Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294
While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.
This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.
Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.
Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826
And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```
Reviewed By: ngimel
Differential Revision: D23964734
Pulled By: bertmaher
fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45433
Primarily in order to pick up the fix landed in https://github.com/pytorch/tensorpipe/pull/225 which fixes the handling of scopes in link-local IPv6 addresses, which was reported by a user.
Test Plan: The specific upstream change is covered by new unit tests. The submodule update will be validated by the PyTorch CI.
Reviewed By: beauby
Differential Revision: D23962289
fbshipit-source-id: 4ed762fc19c4aeb1398d1337d61b3188c4c228be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45438
Adds torchelastic (as well as its dependencies) to the official docker
images
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: tierex
Differential Revision: D23963787
Pulled By: seemethere
fbshipit-source-id: 54ebb4b9c50699e543f264975dadf99badf55753
Summary:
As per title. Fixes [#{38948}](https://github.com/pytorch/pytorch/issues/38948). Therein you can find some blueprints for the algorithm being used in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43002
Reviewed By: zou3519
Differential Revision: D23931326
Pulled By: albanD
fbshipit-source-id: e6994af70d94145f974ef87aa5cea166d6deff1e
Summary:
Changes the deprecation of norm to a docs deprecation, since PyTorch components still rely on norm and some behavior, like automatically flattening tensors, may need to be ported to torch.linalg.norm. The documentation is also updated to clarify that torch.norm and torch.linalg.norm are distinct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45415
Reviewed By: ngimel
Differential Revision: D23958252
Pulled By: mruberry
fbshipit-source-id: fd54e807c59a2655453a6bcd9f4073cb2c12e8ac
Summary:
Fix a couple of issues with scripting inplace indexing in prepare_inplace_ops_for_onnx pass.
1- Tracing index copy (such as cases lik x[1:3] = data) already applies broadcasting on rhs if needed. The broadcasting node (aten::expand) is missing in scripting cases.
2- Inplace indexing with ellipsis (aten::copy_) is replaced with aten::index_put and then handled with slice+select in this pass.
Support for negative indices for this op added.
Shape inference is also enabled for scripting tests using new JIT API.
A few more tests are enabled for scripting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44351
Reviewed By: ezyang
Differential Revision: D23880267
Pulled By: bzinodev
fbshipit-source-id: 78b33444633eb7ae0fbabc7415e3b16001f5207f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45143
This PR prevents freezing cleaning up a submodule when user requests to
preserve a submodule.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D23844969
Pulled By: bzinodev
fbshipit-source-id: 80e6db3fc12460d62e634ea0336ae2a3551c2151
Summary:
in ONNX NegativeLogLikelihoodLoss specification, ignore_index is optional without default value.
therefore, when convert nll op to ONNX, we need to set ignore_index attribute even if it is not specified (e.g. ignore_index=-100).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44816
Reviewed By: ezyang
Differential Revision: D23880354
Pulled By: bzinodev
fbshipit-source-id: d0bdd58d0a4507ed9ce37133e68533fe6d1bdf2b
Summary:
Optimize export_onnx api to reduce string and model proto exchange in export.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44332
Reviewed By: bwasti, eellison
Differential Revision: D23880129
Pulled By: bzinodev
fbshipit-source-id: 1d216d8f710f356cbba2334fb21ea15a89dd16fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419
Closes https://github.com/pytorch/pytorch/issues/39969
This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.
This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899
Reviewed By: pritamdamania87
Differential Revision: D23591274
fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967
When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D23790729
fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule
Test Plan:
unittest
buck test caffe2/caffe2/python/operator_test:learning_rate_op_test
Differential Revision: D23938169
fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45223
Previous diffs in this stack implemented the getNumKeys and deleteKey
APIs in the c10d Store as well as added tests at the C++ layer. This diff adds
tests at the Python level in test_c10d.py
ghstack-source-id: 112939763
Test Plan: Ensured these new python tests as well as previous C++ tests pass
Reviewed By: jiayisuse
Differential Revision: D23878455
fbshipit-source-id: 0a17ecf66b28d46438a77346e5bf36414e05e25c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963
Added a DeleteKey API for the TCP Store
ghstack-source-id: 112939762
Test Plan:
Modified the existing get/set test to use delete. verified that the
correct keys were deleted and that the numKeys API returned the right values
Reviewed By: jiayisuse
Differential Revision: D23009117
fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962
TCPStore needs a getNumKeys API for our logging needs.
ghstack-source-id: 112939761
Test Plan: Adding tests to C++ Store Tests
Reviewed By: pritamdamania87
Differential Revision: D22985085
fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83
Summary:
This PR adds get_all_users_of function. The function returns all the users of a specific node. A test unit is also added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45216
Reviewed By: ezyang
Differential Revision: D23883572
Pulled By: scottxu0730
fbshipit-source-id: 3eb68a411c3c6db39ed2506c9cb7bb7337520ee4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45306
Adds details to the main quantization doc on how specifically
users can skip or customize quantization of layers.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23917034
Pulled By: vkuzo
fbshipit-source-id: ccf71ce4300c1946b2ab63d1f35a07691fd7a2af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45305
Adds an explanatation for reduce_range to the main quantization
doc page.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23916669
Pulled By: vkuzo
fbshipit-source-id: ef93fb774cb15741cd92889f114f6ab76c39f051
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45135
The previous quantization summary had steps on what to do for
dynamic, static, QAT. This PR moves these steps to comments in the
example code, so it is more clear how to accomplish the steps.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23842456
Pulled By: vkuzo
fbshipit-source-id: db2399e51e9ae33c8a1ac610e3d7dbdb648742b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45093
This adds a tl;dr; style summary of the quantization API
to the documentation. Hopefully this will make this easier
for new folks to learn how to use quantization.
This is not meant to be all-encompassing. Future PRs
can improve the documentation further.
Test Plan:
1. build the doc as specified in https://github.com/pytorch/pytorch#building-the-documentation
2. inspect the quantization page in Chrome, format looks good
Reviewed By: jerryzh168
Differential Revision: D23828257
Pulled By: vkuzo
fbshipit-source-id: 9311ee3f394cd83af0aeafb6e2fcdc3e0321fa38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221
This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D23935256
Pulled By: wanchaol
fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44715
We have provided a nice and intuitive API in Python. But in the context of large scale distributed training (e.g. Distributed Model Parallel), users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency.
This PR introduces functional optimizer concept (that is similar to the concept of `nn.functional`), we split optimizer into two parts: 1. optimizer state management 2. optimizer computation. We expose the computation part as a separate functional API that is available to be used by internal and OSS developers, the caller of the functional API will maintain their own states in order to directly calls the functional API. While maintaining the end user API be the same, the functional API is TorchScript friendly, and could be used by the distributed optimizer to speed up the training without GIL.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D23935258
Pulled By: wanchaol
fbshipit-source-id: d2a5228439edb3bc64f7771af2bb9e891847136a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353
Temporarily removing this feature, will add this back after branch cut.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D23939865
Pulled By: mrshenli
fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44433
Not entirely sure why, but changing the type of beta from `float` to `double in autocast_mode.cpp and FunctionsManual.h fixes my compiler errors, failing instead at link time
fixing some type errors, updated fn signature in a few more files
removing my usage of Scalar, making beta a double everywhere instead
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D23636720
Pulled By: bdhirsh
fbshipit-source-id: caea2a1f8dd72b3b5fd1d72dd886b2fcd690af6d