Summary:
Pull Request resolved: https://github.com/pytorch/torchrec/pull/39
Pull Request resolved: https://github.com/facebookresearch/torchrec/pull/6
This makes it so that shared parameters get their own entry in `named_parameters`.
More broadly, this makes it so that
```
params_and_buffers = {**mod.named_named_parameters(remove_duplicate=False), **mod.named_buffers(remove_duplicate=False)}
_stateless.functional_call(mod, params_and_buffers, args, kwargs)
```
is identical to calling the original module's forwards pass.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71542
Reviewed By: jbschlosser, albanD
Differential Revision: D33716716
Pulled By: Chillee
fbshipit-source-id: ff1ed9980bd1a3f7ebaf695ee5e401202b543213
(cherry picked from commit d6e3ad3cd0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68035
RemoteModule is sometimes created using object.__new__ (ex:
init_from_module_rref), in this case the logging in the __init__ method would
not pick this up.
As a result, adding a `__new__` method to RemoteModule to log all usages
appropriately.
ghstack-source-id: 142762019
Test Plan: waitforbuildbot
Reviewed By: vipannalla
Differential Revision: D32263978
fbshipit-source-id: a95ab0bb5d0836da8fe6333c41593af164b008d9
Summary:
RecursiveScriptModule has its customized `__copy__` and `__deepcopy__` defined. The warning/error that says it is not copiable is outdated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64085
Reviewed By: rohan-varma
Differential Revision: D30598623
Pulled By: gmagogsfm
fbshipit-source-id: 0701d8617f42d818bc7b88244caee4cd47fbe976
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62927
As part of the ShardedTensor work, we realized we do need some sort of
_RemoteDevice structure that deals with our format of "workername/device" so
that users don't have to worry about parsing this string directly.
Right now this structure is just the bare minimum and is mostly a container for
describing a remote device. It is currently only used in ShardedTensor,
ShardingSpec and RemoteModule.
Once we actually have a consolidated remote device proposal, this class can be
extended appropriately if needed.
ghstack-source-id: 135534086
Test Plan:
1) unit tests
2) waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D30170689
fbshipit-source-id: 1ac2e81c7a597dc40bf3fbf2c1168c382c66649f
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.
With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006
Reviewed By: jbschlosser, malfet
Differential Revision: D29133237
Pulled By: albanD
fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59026
#Closes: https://github.com/pytorch/pytorch/issues/51480
Enabled methods train and eval in RemoteModule to call the underlying train/eval methods on the actual
nn.Module
ghstack-source-id: 130421137
Test Plan:
Call these two updated methods in method test_send_remote_module_over_the_wire in remote_module_test.py. To test the correctness, after running method train, the training mode should be set to True; after running method eval, the training mode of the remote module should be set to False.
Related test output:
✓ Pass: caffe2/test/distributed/rpc:process_group_agent - test_send_remote_module_over_the_wire (fb.test_process_group_agent.ProcessGroupThreeWorkersRemoteModuleTestWithFork) (23.059)
✓ Pass: caffe2/test/distributed/rpc:thrift_agent - test_send_remote_module_over_the_wire (fb.test_thrift_agent.ThriftThreeWorkersRemoteModuleTestWithFork) (27.965)
✓ Pass: caffe2/test/distributed/rpc:process_group_agent - test_send_remote_module_over_the_wire (test_process_group_agent.ProcessGroupThreeWorkersRemoteModuleTestWithSpawn) (74.481)
✓ Pass: caffe2/test/distributed/rpc:thrift_agent - test_send_remote_module_over_the_wire (fb.test_thrift_agent.ThriftThreeWorkersRemoteModuleTestWithSpawn) (77.243)
✓ Pass: caffe2/test/distributed/rpc:tensorpipe_agent - test_send_remote_module_over_the_wire (fb.test_tensorpipe_agent.TensorPipeThreeWorkersRemoteModuleTestWithFork) (58.644)
✓ Pass: caffe2/test/distributed/rpc:tensorpipe_agent - test_send_remote_module_over_the_wire (test_tensorpipe_agent.TensorPipeThreeWorkersRemoteModuleTestWithSpawn) (90.229)
Reviewed By: pritamdamania87, SciPioneer
Differential Revision: D28721078
fbshipit-source-id: aa45c1e5755f583200144ecfec3704f28221972c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59242
#Oringal PR Issue: https://github.com/pytorch/pytorch/issues/58274
This can be a workaround: Instead of passing a script `RemoteModule` over RPC, pass its `module_rref` field over RPC, and then construct a new `RemoteModule` on the receiver end.
ghstack-source-id: 130268018
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire_script_not_supported
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_create_remote_module_by_module_rref
Reviewed By: vipannalla
Differential Revision: D28794905
fbshipit-source-id: 1a677ff0d4b47c078ad47b50d7102a198a1fc39b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55728
Full design: https://github.com/pytorch/pytorch/issues/55207
This PR introduces ChunkShardingSpec (SingleShardingSpec in the design). Used
the name ChunkShardingSpec since it is very similar to `torch.chunk` in terms
of how a Tensor is split up and feels more clear compared to SingleShardingSpec.
ghstack-source-id: 129603318
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D27694108
fbshipit-source-id: c8764abe6a4d5fc56d023fda29b74b5af2a73b49
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58345
1. Add a sanity check to make sure any new attribute added to the constructor should be added to either `_REMOTE_MODULE_ATTRIBUTES_IGNORE_FOR_PICKLING` pr `_REMOTE_MODULE_ATTRIBUTES_IGNORE_FOR_PICKLING`.
2. Update some comments and warning -- now if a new attribute is added after the construction, it will not be pickled. Previously it will trigger a runtime error, which is hard for unit test (one worker hit the runtime error, but the other worker will cause timeout).
Context: https://github.com/pytorch/pytorch/pull/58019#discussion_r632322083
ghstack-source-id: 129070358
Test Plan: unit test
Reviewed By: rohan-varma
Differential Revision: D28460744
fbshipit-source-id: 8028186fc447c88fbf2bf57f5c5d321f42ba54ed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58019
In order to support sending `RemoteModule` over PRC, previously the pickling/unpickling of `RemoteModule` was implemented based on `__setstate__` and `__getstate__`. However, this means that the user can call regular Python pickler/unpickler to invoke the same logic,which should not be allowed.
This PR ensures that the pickling can only happen over RPC and not via regular python pickle.
Additionally, when a new attribute is added to `RemoteModule`, if it's not added to either `_REMOTE_MODULE_PICKLED_ATTRIBUTES` or `_REMOTE_MODULE_ATTRIBUTES_IGNORE_FOR_PICKLING`, this attribute will be ignored and an error message will be printed to std.err. However, it will not raise an exception like before, because such exception raised at the RPC layer will somehow cause timeout.
#Closes: https://github.com/pytorch/pytorch/issues/57516
ghstack-source-id: 128868501
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_with_a_new_attribute_ignored_over_the_wire
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
buck test mode/dev-nosan //caffe2/torch/fb/csrc/concurrency/test:atomic_int_interprocess_test -- --exact 'caffe2/torch/fb/csrc/concurrency/test:atomic_int_interprocess_test - test_multiple_processes (caffe2.torch.fb.csrc.concurrency.test.atomic_int_interprocess_test.ForkMultipleProcessTest)'
buck test mode/dev //caffe2/torch/distributed/fb/test:app_test -- --exact 'caffe2/torch/distributed/fb/test:app_test - test_custom_init_rpc (caffe2.torch.distributed.fb.test.app_test.TestRpc)'
Reviewed By: mrshenli
Differential Revision: D28318270
fbshipit-source-id: 7e7df2a6690f0860c4531a244d38789db424496f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57413
An internal test fails because somehow `Tuple[()]` is not considered compatible with `Tuple[Any]` in TorchScript, even if the code that involves this type of variables is not executed at all.
Therefore, create separate templates for instantiation to avoid typing check failure. This can address the FIXME left in https://github.com/pytorch/pytorch/pull/57288
#Closes: https://github.com/pytorch/pytorch/issues/51670
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts
Reviewed By: wanchaol
Differential Revision: D28138864
fbshipit-source-id: 39e3e67b0c3979b607ff104d84b4fb1070ffefd6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288
If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire.
Also add pybind of `_get_device_map`.
The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`.
Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure.
#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1
CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`.
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts
Reviewed By: wanchaol
Differential Revision: D28021672
fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54812
Needed for quantization since different attribute might refer to the same module instance
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D27408376
fbshipit-source-id: cada85c4a1772d3dd9502c3f6f9a56d690d527e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54645
Had to replace RRef[..] with just RRef in the return signature since
sphynx seemed to completely mess up rendering RRef[..]
ghstack-source-id: 125024783
Test Plan: View locally.
Reviewed By: SciPioneer
Differential Revision: D27314609
fbshipit-source-id: 2dd9901e79f31578ac7733f79dbeb376f686ed75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53084
Adding RemoteModule to master RPC docs since it is a prototype
feature.
ghstack-source-id: 122816689
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D26743372
fbshipit-source-id: 00ce9526291dfb68494e07be3e67d7d9c2686f1b
Summary:
Add a new device type 'XPU' ('xpu' for lower case) to PyTorch. Changes are needed for code related to device model and kernel dispatch, e.g. DeviceType, Backend and DispatchKey etc.
https://github.com/pytorch/pytorch/issues/48246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49786
Reviewed By: mrshenli
Differential Revision: D25893962
Pulled By: ezyang
fbshipit-source-id: 7ff0a316ee34cf0ed6fc7ead08ecdeb7df4b0052
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48983
Expose an API for users to retrieve the RRef for the underlying module.
This would be useful if users would like to run custom code on the remote end for the nn.Module.
Original PR issue: RemoteModule enhancements #40550
ghstack-source-id: 118378601
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D25386042
fbshipit-source-id: 2dff33e8d5c9770be464eacf0b26c3e82f49a943
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773
Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format:
"<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0")
This arg merges the original `on` and `device` arg.
Original PR issue: RemoteDevice Format #46554
ghstack-source-id: 115448051
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D24482562
fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44254
Add a device parameter to RemoteModule, so it can be placed on any device
and not just CPU.
Original PR issue: RemoteModule enhancements #40550
Test Plan: buck test test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D23483803
fbshipit-source-id: 4918583c15c6a38a255ccbf12c9168660ab7f6db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43906
This method returns a list of RRefs of remote parameters that can be fed into the DistributedOptimizer.
Original PR issue: RemoteModule enhancements #40550
Test Plan: buck test caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: rohan-varma
Differential Revision: D23399586
fbshipit-source-id: 4b0f1ccf2e47c8a9e4f79cb2c8668f3cdbdff820