Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57413
An internal test fails because somehow `Tuple[()]` is not considered compatible with `Tuple[Any]` in TorchScript, even if the code that involves this type of variables is not executed at all.
Therefore, create separate templates for instantiation to avoid typing check failure. This can address the FIXME left in https://github.com/pytorch/pytorch/pull/57288
#Closes: https://github.com/pytorch/pytorch/issues/51670
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts
Reviewed By: wanchaol
Differential Revision: D28138864
fbshipit-source-id: 39e3e67b0c3979b607ff104d84b4fb1070ffefd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288
If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire.
Also add pybind of `_get_device_map`.
The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`.
Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure.
#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1
CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`.
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts
Reviewed By: wanchaol
Differential Revision: D28021672
fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56943
If the module is placed on a CUDA device, then all the CPU tensors in `args` and `kwargs` will also be implicitly moved to the same CUDA device to run forward.
Currently still need to move the forward output from CUDA device back to CPU, until:
1) Process group RPC backend is completely deprecated, and we always use TensorPipe RPC backend;
2) A device map is explicitly provided to TensorPipe RPC backend.
These steps will be done in a separate PR.
#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
ghstack-source-id: 127457584
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test - test_load_di_parts (caffe2.torch.fb.training_toolkit.applications.sparse_nn.batch_distributed_inference.tests.batch_distributed_inference_test.BatchDistributedInferenceTest)'
Reviewed By: wanchaol
Differential Revision: D27934791
fbshipit-source-id: de27e27b905db83cc52800e63684fc6c942e9dc7
Summary:
As this diff shows, currently there are a couple hundred instances of raw `noqa` in the codebase, which just ignore all errors on a given line. That isn't great, so this PR changes all existing instances of that antipattern to qualify the `noqa` with respect to a specific error code, and adds a lint to prevent more of this from happening in the future.
Interestingly, some of the examples the `noqa` lint catches are genuine attempts to qualify the `noqa` with a specific error code, such as these two:
```
test/jit/test_misc.py:27: print(f"{hello + ' ' + test}, I'm a {test}") # noqa E999
test/jit/test_misc.py:28: print(f"format blank") # noqa F541
```
However, those are still wrong because they are [missing a colon](https://flake8.pycqa.org/en/3.9.1/user/violations.html#in-line-ignoring-errors), which actually causes the error code to be completely ignored:
- If you change them to anything else, the warnings will still be suppressed.
- If you add the necessary colons then it is revealed that `E261` was also being suppressed, unintentionally:
```
test/jit/test_misc.py:27:57: E261 at least two spaces before inline comment
test/jit/test_misc.py:28:35: E261 at least two spaces before inline comment
```
I did try using [flake8-noqa](https://pypi.org/project/flake8-noqa/) instead of a custom `git grep` lint, but it didn't seem to work. This PR is definitely missing some of the functionality that flake8-noqa is supposed to provide, though, so if someone can figure out how to use it, we should do that instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56272
Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI run (before this PR was finished) failed:
- https://github.com/pytorch/pytorch/runs/2365189927
Reviewed By: janeyx99
Differential Revision: D27830127
Pulled By: samestep
fbshipit-source-id: d6dcf4f945ebd18cd76c46a07f3b408296864fcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40902
See the bottom of this stack for context.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D22360210
Pulled By: suo
fbshipit-source-id: 4275127173a36982ce9ad357aa344435b98e1faf
Summary:
Previously the module would log some data using `print()`. This can be
a problem when used in contexts where the process expects to write data to
stdout itself. This diff changes the log statements to use `logger` instead.
This makes it similar to other log statements in the same module.
Test Plan:
Confirmed no weird test showed up when running:
buck test caffe2/test/distributed/nn/api:remote_module_fork
Differential Revision: D22136172
fbshipit-source-id: a3d144eba6c75925ed684981793c84b36eb45a5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40173
- Avoid path sharing across runs and workers, so even the test methods/workers run in parallel on the same host, they don't interfere with each other.
- On some environment (e.g. fb internal CI platform), the torch package file tree is not writable. But the temporary folder chosen by Python `tempfile` module is always writable, on linux it's "/tmp".
close https://github.com/pytorch/pytorch/issues/40120
ghstack-source-id: 106086340
Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator
buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_scripted_remote_module_template
buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \
buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_non_scripted_remote_module_template
```
```
buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork
```
Differential Revision: D5708493
fbshipit-source-id: dd92695682433aaf79d1912c7956cef40a450eaf