pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Iurii Zdebskyi	722faeb2a4	[RELAND] Added optimizers based on multi tensor apply (#45408 ) Summary: Original PR https://github.com/pytorch/pytorch/pull/45299. The present PR fixes minor bugs that caused revert. Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly. ### Tests - updated existing tests to use both optimizers - added `test_multi_tensor_optimizers` test to verify correctness. ### Perf results Adam timeit: 42.69 ms --> 10.16 ms autorange: 41.96 ms --> 10.28 ms AdamW timeit: 51.38 ms --> 15.63 ms autorange: 50.82 ms --> 16.07 ms SGD timeit: 6.28 ms --> 4.40 ms autorange: 6.13 ms --> 4.73 ms RMSprop timeit: 28.63 ms --> 5.89 ms autorange: 28.27 ms --> 5.76 ms Rprop timeit: 213.30 --> 178.42 autorange: 212.03 --> 178.03 ASGD timeit: 21.67 --> 9.33 autorange: 21.64 --> 9.27 Adamax timeit: 55.60 --> 48.29 autorange: 55.22 -> 49.13 Rerf Script used ``` import torch import time import torch.optim as optim from torch.autograd import Variable from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR import torch.nn as nn import time import torchvision import torch.utils._benchmark as benchmark_utils device = "cuda" model = torchvision.models.resnet.resnet101(pretrained=True).to(device) targets = torch.randint(0, 1000, (100, 100), device=device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer. # would compare optim.SGD vs optim._multi_tensor.SGD running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device=device).random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="optimizer.step()", globals=globals(), label="str(optimizer)", ) for i in range(1): print(f"Run: {i}\n{'-' * 40}") print(f"timeit:\n{timer.timeit(1000)}\n") print(f"autorange:\n{timer.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45408 Reviewed By: gchanan Differential Revision: D23956680 Pulled By: izdeby fbshipit-source-id: c5eab7bf5fce14a287c15cead1cdc26e42cfed94	2020-09-28 13:14:04 -07:00
Bram Wasti	87b356d093	[static runtime] Split out graph preparation from runtime (#44131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604305 Pulled By: bwasti fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6	2020-09-28 13:01:23 -07:00
Nikolay Korovaiko	6ab1c0b1ca	Disable a few tests in preparation to enabling PE+TE (#44815 ) Summary: Disable a few tests in preparation to enabling PE+TE Next PR: https://github.com/pytorch/pytorch/pull/45396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/44815 Reviewed By: ZolotukhinM Differential Revision: D23948445 Pulled By: Krovatkin fbshipit-source-id: 93e641b7b8a3f13bd3fd3840116076553408f224	2020-09-28 12:55:12 -07:00
Xiang Gao	36c3fbc9e3	CUDA BFloat Conv (non-cuDNN) (#45007 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007 Reviewed By: zou3519 Differential Revision: D23933174 Pulled By: ngimel fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78	2020-09-28 11:42:42 -07:00
Bert Maher	03342af3a3	Add env variable to bypass CUDACachingAllocator for debugging (#45294 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294 While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` Reviewed By: ngimel Differential Revision: D23964734 Pulled By: bertmaher fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39	2020-09-28 11:40:04 -07:00
Nikolay Korovaiko	993628c74a	Build shape expressions and remove outputs that are only used by `aten::size`s (#45080 ) Summary: Currently, TE materializes all intermediate results even if they are only used for computing their shapes. This diff ports the approach the OF (Old Fuser) took to deal with this issue. Namely, given the structure of a fusion group we infer all the sizes outside a fusion group based on fusion group's inputs. A simple example would be: ``` def test_fuse(a, b): c = a + b d = c + b return d ``` Here we don't need to cache `c` as computing a gradient for `b` in `d = c + b` doesn't need it. We do need to compute sizes for all arguments here in case broadcasts happen. Without this optimization, TE would need to materialize `c` so we can get its size ``` [DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph: [DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor, [DUMP profiling_graph_executor_impl.cpp:499] %b.1 : Tensor): [DUMP profiling_graph_executor_impl.cpp:499] %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1) [DUMP profiling_graph_executor_impl.cpp:499] return (%11) [DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor, [DUMP profiling_graph_executor_impl.cpp:499] %13 : Tensor): [DUMP profiling_graph_executor_impl.cpp:499] %59 : int[] = aten::size(%13) # <string>:3:44 [DUMP profiling_graph_executor_impl.cpp:499] %62 : int[] = aten::size(%11) # <string>:3:93 [DUMP profiling_graph_executor_impl.cpp:499] %83 : Double(1:1, requires_grad=0, device=cuda:0), %84 : Double(1:1, requires_grad=0, device=cuda:0), %85 : bool = prim::TypeCheck(%11, %13) [DUMP profiling_graph_executor_impl.cpp:499] %86 : Tensor, %87 : Tensor = prim::If(%85) [DUMP profiling_graph_executor_impl.cpp:499] block0(): [DUMP profiling_graph_executor_impl.cpp:499] %d.4 : Double(1:1, requires_grad=0, device=cuda:0), %c.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%83, %84) [DUMP profiling_graph_executor_impl.cpp:499] -> (%d.4, %c.4) [DUMP profiling_graph_executor_impl.cpp:499] block1(): [DUMP profiling_graph_executor_impl.cpp:499] %94 : Function = prim::Constant[name="fallback_function", fallback=1]() [DUMP profiling_graph_executor_impl.cpp:499] %95 : (Tensor, Tensor) = prim::CallFunction(%94, %11, %13) [DUMP profiling_graph_executor_impl.cpp:499] %96 : Tensor, %97 : Tensor = prim::TupleUnpack(%95) [DUMP profiling_graph_executor_impl.cpp:499] -> (%96, %97) [DUMP profiling_graph_executor_impl.cpp:499] %60 : int[] = aten::size(%87) # <string>:3:55 [DUMP profiling_graph_executor_impl.cpp:499] %61 : int[]? = aten::_size_if_not_equal(%59, %60) # <string>:3:19 [DUMP profiling_graph_executor_impl.cpp:499] %64 : int[]? = aten::_size_if_not_equal(%62, %60) # <string>:3:68 [DUMP profiling_graph_executor_impl.cpp:499] %67 : int[] = aten::size(%86) # <string>:3:55 [DUMP profiling_graph_executor_impl.cpp:499] %68 : int[]? = aten::_size_if_not_equal(%60, %67) # <string>:3:19 [DUMP profiling_graph_executor_impl.cpp:499] %71 : int[]? = aten::_size_if_not_equal(%62, %67) # <string>:3:68 [DUMP profiling_graph_executor_impl.cpp:499] return (%86, %61, %64, %68, %71) [DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0), [DUMP profiling_graph_executor_impl.cpp:499] %4 : Double(1:1, requires_grad=0, device=cuda:0)): [DUMP profiling_graph_executor_impl.cpp:499] %5 : int = prim::Constant[value=1]() [DUMP profiling_graph_executor_impl.cpp:499] %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16 [DUMP profiling_graph_executor_impl.cpp:499] %2 : int = prim::Constant[value=1]() [DUMP profiling_graph_executor_impl.cpp:499] %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16 [DUMP profiling_graph_executor_impl.cpp:499] return (%d.3, %c.3) ``` With this optimization we use `prim::BroadcastSizes` to compute the size of `c`. No need to materialize it. ``` [DUMP profiling_graph_executor_impl.cpp:499] Optimized Graph: [DUMP profiling_graph_executor_impl.cpp:499] graph(%a.1 : Tensor, [DUMP profiling_graph_executor_impl.cpp:499] %b.1 : Tensor): [DUMP profiling_graph_executor_impl.cpp:499] %11 : Tensor = prim::DifferentiableGraph_0(%b.1, %a.1) [DUMP profiling_graph_executor_impl.cpp:499] return (%11) [DUMP profiling_graph_executor_impl.cpp:499] with prim::DifferentiableGraph_0 = graph(%11 : Tensor, [DUMP profiling_graph_executor_impl.cpp:499] %13 : Tensor): [DUMP profiling_graph_executor_impl.cpp:499] %59 : int[] = aten::size(%13) # <string>:3:44 [DUMP profiling_graph_executor_impl.cpp:499] %62 : int[] = aten::size(%11) # <string>:3:93 [DUMP profiling_graph_executor_impl.cpp:499] %88 : Double(1:1, requires_grad=0, device=cuda:0), %89 : Double(1:1, requires_grad=0, device=cuda:0), %90 : bool = prim::TypeCheck(%11, %13) [DUMP profiling_graph_executor_impl.cpp:499] %91 : Tensor = prim::If(%90) [DUMP profiling_graph_executor_impl.cpp:499] block0(): [DUMP profiling_graph_executor_impl.cpp:499] %d.4 : Double(1:1, requires_grad=0, device=cuda:0) = prim::TensorExprGroup_0(%88, %89) [DUMP profiling_graph_executor_impl.cpp:499] -> (%d.4) [DUMP profiling_graph_executor_impl.cpp:499] block1(): [DUMP profiling_graph_executor_impl.cpp:499] %97 : Function = prim::Constant[name="fallback_function", fallback=1]() [DUMP profiling_graph_executor_impl.cpp:499] %98 : (Tensor) = prim::CallFunction(%97, %11, %13) [DUMP profiling_graph_executor_impl.cpp:499] %99 : Tensor = prim::TupleUnpack(%98) [DUMP profiling_graph_executor_impl.cpp:499] -> (%99) [DUMP profiling_graph_executor_impl.cpp:499] %85 : int[] = aten::size(%91) [DUMP profiling_graph_executor_impl.cpp:499] %86 : int[] = prim::BroadcastSizes(%59, %62) [DUMP profiling_graph_executor_impl.cpp:499] %61 : int[]? = aten::_size_if_not_equal(%59, %86) # <string>:3:19 [DUMP profiling_graph_executor_impl.cpp:499] %64 : int[]? = aten::_size_if_not_equal(%62, %86) # <string>:3:68 [DUMP profiling_graph_executor_impl.cpp:499] %68 : int[]? = aten::_size_if_not_equal(%86, %85) # <string>:3:19 [DUMP profiling_graph_executor_impl.cpp:499] %71 : int[]? = aten::_size_if_not_equal(%62, %85) # <string>:3:68 [DUMP profiling_graph_executor_impl.cpp:499] return (%91, %61, %64, %68, %71) [DUMP profiling_graph_executor_impl.cpp:499] with prim::TensorExprGroup_0 = graph(%1 : Double(1:1, requires_grad=0, device=cuda:0), [DUMP profiling_graph_executor_impl.cpp:499] %4 : Double(1:1, requires_grad=0, device=cuda:0)): [DUMP profiling_graph_executor_impl.cpp:499] %5 : int = prim::Constant[value=1]() [DUMP profiling_graph_executor_impl.cpp:499] %c.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%4, %1, %5) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2872:16 [DUMP profiling_graph_executor_impl.cpp:499] %2 : int = prim::Constant[value=1]() [DUMP profiling_graph_executor_impl.cpp:499] %d.3 : Double(1:1, requires_grad=0, device=cuda:0) = aten::add(%c.3, %1, %2) # /scratch/villedepommes/pytorches/bench/test/test_jit.py:2873:16 [DUMP profiling_graph_executor_impl.cpp:499] return (%d.3) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45080 Reviewed By: bertmaher Differential Revision: D23856410 Pulled By: Krovatkin fbshipit-source-id: 2956286eb03a4894a5baa151c35e6092466322b1	2020-09-28 10:45:56 -07:00
Luca Wehrstedt	e5242aaf89	Update TensorPipe submodule (#45433 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45433 Primarily in order to pick up the fix landed in https://github.com/pytorch/tensorpipe/pull/225 which fixes the handling of scopes in link-local IPv6 addresses, which was reported by a user. Test Plan: The specific upstream change is covered by new unit tests. The submodule update will be validated by the PyTorch CI. Reviewed By: beauby Differential Revision: D23962289 fbshipit-source-id: 4ed762fc19c4aeb1398d1337d61b3188c4c228be	2020-09-28 10:32:06 -07:00
Rong Rong	48d29c830d	[hotfix] disable problematic cuda tests on rocm builds (#45435 ) Summary: Disable the recent 3 cuda tests on amd rocm build/tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/45435 Reviewed By: malfet Differential Revision: D23962881 Pulled By: walterddr fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f	2020-09-28 10:02:12 -07:00
Eli Uriegas	e2ffdf467a	docker: Add torchelastic to docker image (#45438 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45438 Adds torchelastic (as well as its dependencies) to the official docker images Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Reviewed By: tierex Differential Revision: D23963787 Pulled By: seemethere fbshipit-source-id: 54ebb4b9c50699e543f264975dadf99badf55753	2020-09-28 09:53:07 -07:00
Nikita Vedeneev	e4950a093a	Backward support for generalized eigenvalue solver with LOBPCG in forward [only k-rank SYMEIG case] (#43002 ) Summary: As per title. Fixes [#{38948}](https://github.com/pytorch/pytorch/issues/38948). Therein you can find some blueprints for the algorithm being used in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43002 Reviewed By: zou3519 Differential Revision: D23931326 Pulled By: albanD fbshipit-source-id: e6994af70d94145f974ef87aa5cea166d6deff1e	2020-09-28 07:22:35 -07:00
Mike Ruberry	6417a70465	Updates linalg warning + docs (#45415 ) Summary: Changes the deprecation of norm to a docs deprecation, since PyTorch components still rely on norm and some behavior, like automatically flattening tensors, may need to be ported to torch.linalg.norm. The documentation is also updated to clarify that torch.norm and torch.linalg.norm are distinct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45415 Reviewed By: ngimel Differential Revision: D23958252 Pulled By: mruberry fbshipit-source-id: fd54e807c59a2655453a6bcd9f4073cb2c12e8ac	2020-09-28 05:28:42 -07:00
generatedunixname89002005325676	7818a214c5	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D23959094 fbshipit-source-id: 6caa046d263114bff38a38d756099aac357e4f04	2020-09-28 05:08:46 -07:00
Negin Raoof	95a97e51b5	[ONNX] Improve scripting inplace indexing ops (#44351 ) Summary: Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass. 1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases. 2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass. Support for negative indices for this op added. Shape inference is also enabled for scripting tests using new JIT API. A few more tests are enabled for scripting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351 Reviewed By: ezyang Differential Revision: D23880267 Pulled By: bzinodev fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f	2020-09-28 00:32:36 -07:00
Zino Benaissa	13f76f2be4	Fix preserve submodule attribute in freezing (#45143 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143 This PR prevents freezing cleaning up a submodule when user requests to preserve a submodule. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D23844969 Pulled By: bzinodev fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151	2020-09-28 00:05:38 -07:00
liqunfu	c3bf402cbb	handle onnx nll with default ignore index (#44816 ) Summary: in ONNX NegativeLogLikelihoodLoss specification, ignore_index is optional without default value. therefore, when convert nll op to ONNX, we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100). Pull Request resolved: https://github.com/pytorch/pytorch/pull/44816 Reviewed By: ezyang Differential Revision: D23880354 Pulled By: bzinodev fbshipit-source-id: d0bdd58d0a4507ed9ce37133e68533fe6d1bdf2b	2020-09-27 23:26:19 -07:00
Mike Ruberry	8bdbedd4ee	Revert "Updates and simplifies nonzero as_tuple behavior" This reverts commit `8b143771d0`.	2020-09-27 20:58:42 -07:00
Mike Ruberry	8b143771d0	Updates and simplifies nonzero as_tuple behavior	2020-09-27 20:56:30 -07:00
shubhambhokare1	5b839bca78	[ONNX] Optimize export_onnx api to reduce string and model proto exchange (#44332 ) Summary: Optimize export_onnx api to reduce string and model proto exchange in export.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332 Reviewed By: bwasti, eellison Differential Revision: D23880129 Pulled By: bzinodev fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa	2020-09-27 16:29:08 -07:00
neginraoof	4005afe94b	[ONNX] Update narrow for dynamic inputs (#44039 ) Summary: Update narrow for dynamic inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/44039 Reviewed By: mruberry Differential Revision: D23742215 Pulled By: bzinodev fbshipit-source-id: 0d58d2fe996f91a124af988a9a21ee433e842d07	2020-09-27 15:52:57 -07:00
Natalia Gimelshein	78caa028b6	Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store Test Plan: revert-hammer Differential Revision: D23009117 (`addf94f2d6`) Original commit changeset: 1a0d95b43d79 fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d	2020-09-27 12:04:42 -07:00
Natalia Gimelshein	f84b2e865f	Revert D23878455: [Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey Test Plan: revert-hammer Differential Revision: D23878455 (`cf808bed73`) Original commit changeset: 0a17ecf66b28 fbshipit-source-id: 93e60b23f66324e3e5266c45abb0cec295bb3d23	2020-09-27 12:02:24 -07:00
Mikhail Zolotukhin	bc5710f2f7	Benchmarks: tweak PE config settings. (#45349 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349 Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D23935518 Pulled By: ZolotukhinM fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc	2020-09-26 23:13:29 -07:00
Mikhail Zolotukhin	a07d82982a	CI: Add a run of FastRNN benchmarks in default executor/fuser configuration. (#45348 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45348 Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D23935520 Pulled By: ZolotukhinM fbshipit-source-id: efecaaab68caaaa057b354884f4ae37b6ef36983	2020-09-26 23:13:27 -07:00
Mikhail Zolotukhin	8cef7326f4	Benchmarks: add 'default' options for fuser and executor. (#45347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347 Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D23935519 Pulled By: ZolotukhinM fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff	2020-09-26 23:09:02 -07:00
Natalia Gimelshein	37a671abc7	Revert D23828257: Quantization: add API summary section Test Plan: revert-hammer Differential Revision: D23828257 (`d2bd556e7d`) Original commit changeset: 9311ee3f394c fbshipit-source-id: 80b16fc123191e249e6a070ec5360a15fe91cf61	2020-09-26 22:53:10 -07:00
Natalia Gimelshein	110aa45387	Revert D23842456: Quantization: combine previous summary with new summary Test Plan: revert-hammer Differential Revision: D23842456 (`278da57255`) Original commit changeset: db2399e51e9a fbshipit-source-id: 7878257330bf83751cb17c0971a5c894bdf256ba	2020-09-26 22:53:07 -07:00
Natalia Gimelshein	3da1061059	Revert D23916669: quant docs: add reduce_range explanatation to top level doc Test Plan: revert-hammer Differential Revision: D23916669 (`eb39624394`) Original commit changeset: ef93fb774cb1 fbshipit-source-id: 7b56020427e76e13f847494044179c81d508db11	2020-09-26 22:48:38 -07:00
Mike Ruberry	54a253fded	Revert D23931987: Added optimizers based on multi tensor apply Test Plan: revert-hammer Differential Revision: D23931987 (`2b21e7767e`) Original commit changeset: 582134ef2d40 fbshipit-source-id: ffd500aea55fda34155442fb15e2529cb9c00100	2020-09-26 18:11:54 -07:00
Mike Ruberry	e52762cbb7	Revert D23917034: quant docs: document how to customize qconfigs in eager mode Test Plan: revert-hammer Differential Revision: D23917034 (`7763e1d7b1`) Original commit changeset: ccf71ce4300c fbshipit-source-id: 9ce99e880b4a22e824f4413354a0f3703e7c5c2c	2020-09-26 18:05:38 -07:00
Rohan Varma	23dfca8351	Support record_shapes in RPC profiling (#44419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419 Closes https://github.com/pytorch/pytorch/issues/39969 This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument. This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally. ghstack-source-id: 112977899 Reviewed By: pritamdamania87 Differential Revision: D23591274 fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958	2020-09-26 13:26:44 -07:00
Rohan Varma	19dda7c68a	Fallback to CPU when remote end does not have CUDA for profiling (#44967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967 When enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. ghstack-source-id: 112977906 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D23790729 fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362	2020-09-26 13:12:55 -07:00
Iurii Zdebskyi	2b21e7767e	Added optimizers based on multi tensor apply (#45299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45299 Adding a new namespace `torch.optim._multi_tensor` with a bunch of updated optimizers. Those optimizers are using _foreach APIs which improve performance significantly. ### Tests - updated existing tests to use both optimizers - added `test_multi_tensor_optimizers` test to verify correctness. ### Perf results Adam timeit: 42.69 ms --> 10.16 ms autorange: 41.96 ms --> 10.28 ms AdamW timeit: 51.38 ms --> 15.63 ms autorange: 50.82 ms --> 16.07 ms SGD timeit: 6.28 ms --> 4.40 ms autorange: 6.13 ms --> 4.73 ms RMSprop timeit: 28.63 ms --> 5.89 ms autorange: 28.27 ms --> 5.76 ms Rprop timeit: 213.30 --> 178.42 autorange: 212.03 --> 178.03 ASGD timeit: 21.67 --> 9.33 autorange: 21.64 --> 9.27 Adamax timeit: 55.60 --> 48.29 autorange: 55.22 -> 49.13 Rerf Script used ``` import torch import time import torch.optim as optim from torch.autograd import Variable from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR import torch.nn as nn import time import torchvision import torch.utils._benchmark as benchmark_utils device = "cuda" model = torchvision.models.resnet.resnet101(pretrained=True).to(device) targets = torch.randint(0, 1000, (100, 100), device=device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=1e-3) # <----------------------- optimizer. # would compare optim.SGD vs optim._multi_tensor.SGD running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device=device).random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device=device , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="optimizer.step()", globals=globals(), label="str(optimizer)", ) for i in range(1): print(f"Run: {i}\n{'-' * 40}") print(f"timeit:\n{timer.timeit(1000)}\n") print(f"autorange:\n{timer.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23931987 Pulled By: izdeby fbshipit-source-id: 582134ef2d402909d27d89a45c5b588fb7130ea1	2020-09-26 12:17:43 -07:00
Thomas Bredillet	0fa551f0ab	[c2] Fix int types for learning rate Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule Test Plan: unittest buck test caffe2/caffe2/python/operator_test:learning_rate_op_test Differential Revision: D23938169 fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463	2020-09-26 10:59:29 -07:00
Omkar Salpekar	cf808bed73	[Distributed] Adding Python tests for the TCPStore getNumKeys and deleteKey (#45223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45223 Previous diffs in this stack implemented the getNumKeys and deleteKey APIs in the c10d Store as well as added tests at the C++ layer. This diff adds tests at the Python level in test_c10d.py ghstack-source-id: 112939763 Test Plan: Ensured these new python tests as well as previous C++ tests pass Reviewed By: jiayisuse Differential Revision: D23878455 fbshipit-source-id: 0a17ecf66b28d46438a77346e5bf36414e05e25c	2020-09-26 00:54:28 -07:00
Omkar Salpekar	addf94f2d6	[Distributed] DeleteKey API for c10d TCP Store (#43963 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963 Added a DeleteKey API for the TCP Store ghstack-source-id: 112939762 Test Plan: Modified the existing get/set test to use delete. verified that the correct keys were deleted and that the numKeys API returned the right values Reviewed By: jiayisuse Differential Revision: D23009117 fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66	2020-09-26 00:54:21 -07:00
Omkar Salpekar	304e1d1e19	[Distributed] getNumKeys API to c10d TCPStore (#43962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962 TCPStore needs a getNumKeys API for our logging needs. ghstack-source-id: 112939761 Test Plan: Adding tests to C++ Store Tests Reviewed By: pritamdamania87 Differential Revision: D22985085 fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83	2020-09-26 00:49:00 -07:00
Zafar	d9af3d2fcd	[quant] ConvTranspose warnings (#45081 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45081 Test Plan: Imported from OSS Reviewed By: vkuzo Differential Revision: D23822449 Pulled By: z-a-f fbshipit-source-id: f21a5f3ef4d09f703c96fff0bc413dbadeac8202	2020-09-25 22:30:14 -07:00
Wang Xu	92189b34b7	Add get_all_users_of function to GraphManipulation (#45216 ) Summary: This PR adds get_all_users_of function. The function returns all the users of a specific node. A test unit is also added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45216 Reviewed By: ezyang Differential Revision: D23883572 Pulled By: scottxu0730 fbshipit-source-id: 3eb68a411c3c6db39ed2506c9cb7bb7337520ee4	2020-09-25 19:32:49 -07:00
Vasiliy Kuznetsov	7763e1d7b1	quant docs: document how to customize qconfigs in eager mode (#45306 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45306 Adds details to the main quantization doc on how specifically users can skip or customize quantization of layers. Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23917034 Pulled By: vkuzo fbshipit-source-id: ccf71ce4300c1946b2ab63d1f35a07691fd7a2af	2020-09-25 18:33:35 -07:00
Vasiliy Kuznetsov	eb39624394	quant docs: add reduce_range explanatation to top level doc (#45305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45305 Adds an explanatation for reduce_range to the main quantization doc page. Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23916669 Pulled By: vkuzo fbshipit-source-id: ef93fb774cb15741cd92889f114f6ab76c39f051	2020-09-25 18:33:32 -07:00
Vasiliy Kuznetsov	278da57255	Quantization: combine previous summary with new summary (#45135 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45135 The previous quantization summary had steps on what to do for dynamic, static, QAT. This PR moves these steps to comments in the example code, so it is more clear how to accomplish the steps. Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23842456 Pulled By: vkuzo fbshipit-source-id: db2399e51e9ae33c8a1ac610e3d7dbdb648742b0	2020-09-25 18:33:30 -07:00
Vasiliy Kuznetsov	d2bd556e7d	Quantization: add API summary section (#45093 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45093 This adds a tl;dr; style summary of the quantization API to the documentation. Hopefully this will make this easier for new folks to learn how to use quantization. This is not meant to be all-encompassing. Future PRs can improve the documentation further. Test Plan: 1. build the doc as specified in https://github.com/pytorch/pytorch#building-the-documentation 2. inspect the quantization page in Chrome, format looks good Reviewed By: jerryzh168 Differential Revision: D23828257 Pulled By: vkuzo fbshipit-source-id: 9311ee3f394cd83af0aeafb6e2fcdc3e0321fa38	2020-09-25 18:30:51 -07:00
Zafar	958c208666	[quant] conv_transpose graph patterns (#45078 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45078 Test Plan: Imported from OSS Reviewed By: vkuzo Differential Revision: D23821580 Pulled By: z-a-f fbshipit-source-id: 813a4ef1bbc429720765d61791fe754b6678a334	2020-09-25 18:14:29 -07:00
Ailing Zhang	606b1a9a2e	Move xla codegen to aten. (#45241 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45241 Test Plan: Imported from OSS Reviewed By: soumith Differential Revision: D23926750 Pulled By: ailzhang fbshipit-source-id: f768e24a9baeca9f9df069a62d6f8b94a853a1ee	2020-09-25 18:07:32 -07:00
Wanchao Liang	32c355af5b	[dist_optim] introduce distributed functional optimizer (#45221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221 This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D23935256 Pulled By: wanchaol fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a	2020-09-25 17:13:10 -07:00
Wanchao Liang	08caf15502	[optimizer] refactor Adam to use functional API (#44791 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44791 Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D23935257 Pulled By: wanchaol fbshipit-source-id: 6f6e22a9287f5515d2e4e6abd4dee2fe7e17b945	2020-09-25 17:13:08 -07:00
Wanchao Liang	0444c372e1	[optimizer] introduce optimizer functional API, refactor Adagrad (#44715 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44715 We have provided a nice and intuitive API in Python. But in the context of large scale distributed training (e.g. Distributed Model Parallel), users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency. This PR introduces functional optimizer concept (that is similar to the concept of `nn.functional`), we split optimizer into two parts: 1. optimizer state management 2. optimizer computation. We expose the computation part as a separate functional API that is available to be used by internal and OSS developers, the caller of the functional API will maintain their own states in order to directly calls the functional API. While maintaining the end user API be the same, the functional API is TorchScript friendly, and could be used by the distributed optimizer to speed up the training without GIL. Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D23935258 Pulled By: wanchaol fbshipit-source-id: d2a5228439edb3bc64f7771af2bb9e891847136a	2020-09-25 17:10:26 -07:00
Nikita Shulga	8ab2ad306d	Enable `torch.cuda.nccl` typechecking (#45344 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/45336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/45344 Reviewed By: walterddr Differential Revision: D23935306 Pulled By: malfet fbshipit-source-id: dd09d4f8ff7a327131764487158675027a13bf69	2020-09-25 17:02:47 -07:00
Shen Li	5211fb97ac	Remove device maps from TensorPipe for v1.7 release (#45353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353 Temporarily removing this feature, will add this back after branch cut. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D23939865 Pulled By: mrshenli fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e	2020-09-25 16:51:45 -07:00
Brian Hirsh	439930c81b	adding a beta parameter to the smooth_l1 loss fn (#44433 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433 Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time fixing some type errors, updated fn signature in a few more files removing my usage of Scalar, making beta a double everywhere instead Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D23636720 Pulled By: bdhirsh fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d	2020-09-25 16:36:28 -07:00

1 2 3 4 5 ...

30217 Commits