Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46755
As reported in https://github.com/pytorch/pytorch/issues/41324, there is a bug in DDP when `find_unused_parameters=True` and 2 or more parameters share the same gradient accumulator.
In the reducer, we currently keep a mapping of grad accumulator to index and populate it with map[accumulator] = index, but this overwrites indices when the accumulator is the same. To fix this, switch the mapping values to a vector of indices to hold all such indices that share the same accumulator.
ghstack-source-id: 115453567
Test Plan: Added UT
Reviewed By: pritamdamania87
Differential Revision: D24497388
fbshipit-source-id: d32dfa9c5cd0b7a8df13c7873d5d28917b766640
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773
Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format:
"<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0")
This arg merges the original `on` and `device` arg.
Original PR issue: RemoteDevice Format #46554
ghstack-source-id: 115448051
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D24482562
fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46955
Initially we were thinking of adding a `invalidate_quantized_float_parameters` option to free the memory
of quantized floating parameters, but it turns out we will do module swap just like in eager mode for the modules
that are quantized, so the old floating point module will not be referenced after quantization. therefore this feature
is only needed for functionals, since most people are using quantization with modules we may not need this.
we'll revisit after we find there is a need for this.
Test Plan: Imported from OSS
Reviewed By: supriyar
Differential Revision: D24579400
fbshipit-source-id: fbb0e567405dc0604a2089fc001573affdade986
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897
These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.
Also adds/tidies up some documentation.
ghstack-source-id: 115359633
Test Plan: Modified unittests
Reviewed By: divchenko
Differential Revision: D24556177
fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
Summary:
Caffe2 and Torch currently does not have a consistent mechanism for determining if a kernel has launched successfully. The result is difficult-to-detect or silent errors. This diff provides functionality to fix that. Subsequent diffs on the stack fix the identified issues.
Kernel launch errors may arise if invalid launch parameters (number of blocks, number of threads, shared memory, or stream id) are specified incorrectly for the hardware or for other reasons. Interestingly, unless these launch errors are specifically checked for CUDA will silently fail and return garbage answers which can affect downstream computation. Therefore, catching launch errors is important.
Launches are currently checked by placing
```
AT_CUDA_CHECK(cudaGetLastError());
```
somewhere below the kernel launch. This is bad for two reasons.
1. The check may be performed at a site distant to the kernel launch, making debugging difficult.
2. The separation of the launch from the check means that it is difficult for humans and static analyzers to determine whether the check has taken place.
This diff defines a macro:
```
#define TORCH_CUDA_KERNEL_LAUNCH_CHECK() AT_CUDA_CHECK(cudaGetLastError())
```
which clearly indicates the check.
This diff also introduces a new test which analyzes code to identify kernel launches and determines whether the line immediately following the launch contains `TORCH_CUDA_KERNEL_LAUNCH_CHECK();`.
A search of the Caffe2 codebase identifies 104 instances of `AT_CUDA_CHECK(cudaGetLastError());` while the foregoing test identifies 1,467 launches which are not paired with a check. Visual inspection indicates that few of these are false positives, highlighting the need for some sort of static analysis system.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46474
Test Plan:
The new test is run with:
```
buck test //caffe2/test:kernel_launch_checks -- --print-passing-details
```
And should be launched automatically with the other land tests. (TODO: Is it?)
The test is currently set up only to provide warnings but can later be adjusted to require checks.
Otherwise, I rely on the existing test frameworks to ensure that changes resulting from reorganizing existing launch checks don't cause regressions.
Reviewed By: ngimel
Differential Revision: D24309971
Pulled By: r-barnes
fbshipit-source-id: 0dc97984a408138ad06ff2bca86ad17ef2fdf0b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46772
When running `buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit`, I encountered the following error P146518683. The error was traced down to the fact that `torch.allclose` does not work with quantized tensors (the error was triggered by this particular multiplication https://fburl.com/diffusion/8vw647o6 since native mul can not work with a float scalar and a quantized tensor.)
Minimum example to reproduce:
```(Pdb) input = torch.ones(5)
(Pdb) aa = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
(Pdb) bb = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.quint8)
(Pdb) torch.allclose(aa, bb)
Comparison exception: promoteTypes with quantized numbers is not handled yet; figure out what the correct rules should be, offending types: QUInt8 Float
```
Here the proposed fix is to compare quantized tensors strictly within `_compare_tensors_internal`.
The other two possible fixes are:
1. convert quantized tensors to float tensors first before sending them to `torch.allclose`
2. change `torch.allclose` to handle quantized tensor.
Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit
Reviewed By: kimishpatel
Differential Revision: D24506723
fbshipit-source-id: 6426ea2a88854b4fb89abef0edd2b49921283796
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46568
This PR adds support for an RRef.backward() API. This would be useful
in applications like pipeline parallelism as described here:
https://github.com/pytorch/pytorch/issues/44827
This PR only adds support for local RRefs, remote RRef support will be added in
a follow up PR.
ghstack-source-id: 115100729
Test Plan:
1) unit tests.
2) waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D24406311
fbshipit-source-id: fb0b4e185d9721bf57f4dea9847e0aaa66b3e513
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41807
Test Plan: Make sure ci tests pass, including newly written test
Reviewed By: mrshenli
Differential Revision: D22640839
Pulled By: osandoval-fb
fbshipit-source-id: 3ff98d8e8c6e6d08575e307f05b5e159442d7216
Summary:
As per title. Limitations: only for batches of squared full-rank matrices.
CC albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46284
Reviewed By: zou3519
Differential Revision: D24448266
Pulled By: albanD
fbshipit-source-id: d98215166268553a648af6bdec5a32ad601b7814
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372
Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.
Test Plan: Added unittest.
Reviewed By: pritamdamania87
Differential Revision: D24324578
fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304
In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.
The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677
Test Plan: Added unittest, and CI
Reviewed By: pritamdamania87
Differential Revision: D24296377
fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
Summary:
References https://github.com/pytorch/pytorch/issues/42515
> Enable integer -> float unary type promotion for ops like sin
Will follow-up for other such Ops once this PR is merged.
cc: mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45733
Reviewed By: zou3519
Differential Revision: D24431194
Pulled By: mruberry
fbshipit-source-id: db600bc5de0e535b538d2aa301c3526b7c75ed17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46601
* except excluded tests and magic methods.
https://github.com/pytorch/pytorch/issues/38731
Previously, we'd only do run these tests for inplace operations. Since this is a lot more tests, fixed these issues that came up when running them -
- Updated schema of conj() to reflect existing behaviour.
- Updated deepEquals method in check_alias_annotation.cpp to re-use the overloaded == operator. Previous implementation did not cover all types of IValues.
- Corrected the order inputs are passed in during autograd testing of 'view' & 'reshape'.
- Subbed out atn::ger with the func its aliased to, atn::outer, for testing. The alias annotation checking code doesn't handle aliased operators properly.
ghstack-source-id: 114830903
Test Plan: Ran all tests in test:jit and verified they pass.
Reviewed By: eellison
Differential Revision: D24424955
fbshipit-source-id: 382d7e2585911b81b1573f21fff1d54a5e9a2054
Summary:
Resolves one item in https://github.com/pytorch/pytorch/issues/46321
This PR sets up DistExamplesTest which will be used as the class to implement future tests for examples. This class is run as part of CI tests. It also creates a dist_examples folder and includes the [batch server example](https://github.com/pytorch/examples/blob/master/distributed/rpc/batch/parameter_server.py) which is slightly modified to allow to be tested.
Run test:
pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_batch_updating_parameter_server -vs
pytest test/distributed/rpc/test_process_group_agent.py -k test_batch_updating_parameter_server -vs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46510
Reviewed By: mrshenli
Differential Revision: D24379296
Pulled By: H-Huang
fbshipit-source-id: 1c102041e338b022b7a659a51894422addc0e06f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46337
We plan to pass around the mappings instead of using global registration api to keep
the mappings local to the transformations user is performing
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D24317436
fbshipit-source-id: 81569b88f05eeeaa9595447e482a12827aeb961f
Summary:
The `IN_CIRCLECI` variable is a misnomer since the flag really indicates when we enable XML reporting because we want to run the test in CI. Since this doesn't necessarily mean CircleCI in particular, IN_CI is more accurate and general.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46567
Reviewed By: walterddr
Differential Revision: D24407642
Pulled By: janeyx99
fbshipit-source-id: 5e141a0571b914310a174a58ac0fde58e9521c6b
Summary:
According to https://app.circleci.com/pipelines/github/pytorch/pytorch/228154/workflows/31951076-b633-4391-bd0d-b2953c940876/jobs/8290059
TestFFTCUDA.test_fftn_backward_cuda_complex128 takes 242 seconds to finish, where most of the time spent checking 2nd gradient
Refactor common part of test_fft_backward and test_fftn_backward into _fft_grad_check_helper
Introduce `slowAwareTest` decorator
Split test into fast and slow parts by checking 2nd degree gradient only during the slow part
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46509
Reviewed By: walterddr
Differential Revision: D24378901
Pulled By: malfet
fbshipit-source-id: 606670c2078480219905f63b9b278b835e760a66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45989
This test was failing internally for the Thrift-based RPC agent, since
it has a different error regex. Use `self.get_timeout_error_regex` which gets
the timeout error string for each backend to fix this.
ghstack-source-id: 114463458
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D24170394
fbshipit-source-id: 9b30945e3e30f36472268d042173f8175ad88098
Summary:
Related to https://github.com/pytorch/pytorch/issues/44592
This PR is to fix the deprecated warnings for the nan_to_num function.
Below is the warning message when building the latest code.
```
../aten/src/ATen/native/UnaryOps.cpp: In function ‘at::Tensor& at::native::nan_to_num_out(at::Tensor&,
const at::Tensor&, c10::optional<double>, c10::optional<double>, c10::optional<double>)’:
../aten/src/ATen/native/UnaryOps.cpp:397:45: warning: ‘bool c10::isIntegralType(c10::ScalarType)’
is deprecated: isIntegralType is deprecated.
Please use the overload with 'includeBool' parameter instead. [-Wdeprecated-declarations]
if (c10::isIntegralType(self.scalar_type())) {
```
The deprecated warning is defined in `ScalarType.h`.
d790ec6de0/c10/core/ScalarType.h (L255-L260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46309
Reviewed By: mrshenli
Differential Revision: D24310248
Pulled By: heitorschueroff
fbshipit-source-id: 0f9f2ad304eb5a2da9d2b415343f2fc9029037af
Summary:
This PR adds support for complex-valued input for `torch.pinverse`.
Fixed cuda SVD implementation to return singular values with real dtype.
Fixes https://github.com/pytorch/pytorch/issues/45385.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819
Reviewed By: heitorschueroff
Differential Revision: D24306539
Pulled By: anjali411
fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45940
**Summary**
In `try_ann_to_type`, if an annotation has an attribute named
`__torch_script_class__`, it is assumed to be a TorchScript class that
has already been scripted. However, if it is a class that extends
another class, this code path causes a crash because it looks up the
JIT type for the class by name in the compilation unit. This JIT type
obviously cannot exist because inheritance is not supported.
This commit fixes this by looking up the qualified name of a class
in torch.jit._state._script_class in order to ascertain whether it has
already been scripted (instead of looking for a `__torch_script_class__`
attribute on the class object.
**Test Plan**
This commit adds a unit test consisting of the code sample from the
issue that reported this problem.
**Fixes**
This commit fixes#45860.
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D24310027
Pulled By: SplitInfinity
fbshipit-source-id: 9f8225f3316fd50738d98e3544bf5562b16425b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221
The RPC framework only allowed sending RPCs based on provided
WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier
to refer to everything in terms of ranks since DDP doesn't support names yet.
As a result, support a `to` parameter in the RPC APIs which allow for
specifying a rank as well would be helpful.
ghstack-source-id: 114207172
Test Plan:
1) waitforbuildbot
2) Unit Tests
Reviewed By: mrshenli
Differential Revision: D24264989
fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994
Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24172484
fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933
Occasionally users run DDP with models with unused params, in this
case we would like to surface an error message telling them to run with
find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made
it so that we raise a size mismatch error when this happens, but the
information about unused parameters is likely to be more useful and likely to
be the most common case of failure. Prefer raising this error over the
subsequent size mismatch errors.
ghstack-source-id: 113914759
Test Plan: Added unittest
Reviewed By: mrshenli
Differential Revision: D24151256
fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675
Summary:
`TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout)
`TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each
`TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish
Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068
Reviewed By: mruberry
Differential Revision: D24208660
Pulled By: malfet
fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45873
This diff adds support for sending/receiving to/from self. It also fixed a bug when p2p operations are not used by all processes.
ghstack-source-id: 113910526
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24124413
fbshipit-source-id: edccb830757ac64f569e7908fec8cb2b43cd098d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783
After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.
This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.
Differential Revision: D24094803
Test Plan: Imported from OSS
Reviewed By: mruberry
Pulled By: mrshenli
fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b