Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30330
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.
ghstack-source-id: 94673884
ghstack-source-id: 94673884
Test Plan: Unit tests pass.
Reviewed By: mrshenli
Differential Revision: D18661775
fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30217
Before this commit, RRefContext throws an error if it detects any
RRef leak during shutdown. However, this requires applications to
make sure that is has freed all references to RRefs in application
code, which can be a bad debugging experience when for large
applications. Besides, this also relies on Python GC to free things
up in time, which might not always be true. After this commit,
RRefContext would ignore leaking RRefs during shutdown, as shutdown
is called when the application has finished training and no longer
care about local states. Hence, it should be OK to just ignore
those leaks and destroy OwnerRRefs. If application would like to
enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak
to False.
Test Plan: Imported from OSS
Differential Revision: D18632546
Pulled By: mrshenli
fbshipit-source-id: 2744b2401dafdd16de0e0a76cf8e07777bed0f38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208
Adds default arg for init_method so users don't have to pass this in,
and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs.
ghstack-source-id: 94500475
Test Plan: Unit tests pass.
Reviewed By: mrshenli
Differential Revision: D18630074
fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30020
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.
ghstack-source-id: 94415336
Test Plan: Unit tests pass.
Differential Revision: D5578006
fbshipit-source-id: 6258879fb44c9fca97fdfad64468c1488c16ac02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201
Provide a default constructor so that users don't have to construct
RPC agent options. Also rename this to RPCBackend Options as suggested.
ghstack-source-id: 94411768
Test Plan: Unit tests pass.
Differential Revision: D18628698
fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30050
Renames this API to wait_all_workers as discussed.
ghstack-source-id: 94273005
Test Plan: Unit tests pass
Differential Revision: D18581466
fbshipit-source-id: 4ff5d5fb2d528f17252d5b5f30c3047d2efb92bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.
To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295
Test Plan:
### OSS RPC + RRef tests
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```
### Prototype RRef tests
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```
### Dist autograd
```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```
Differential Revision: D18595578
fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30092
There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.
ghstack-source-id: 94196891
Test Plan:
### RPC + RRef
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
buck test mode/dev-nosan //caffe2/test:rpc_spawn
```
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```
### Dist Autograd
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```
### Dist Optimizer
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```
Differential Revision: D18595408
fbshipit-source-id: 8360759c63e838fb19d4eb1aeacca0bf8eb4b55f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29747
There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.
Test Plan:
### RPC + RRef
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
buck test mode/dev-nosan //caffe2/test:rpc_spawn
```
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```
### Dist Autograd
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```
### Dist Optimizer
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```
Differential Revision: D5689636
fbshipit-source-id: f35eea1359addaaac9bd8d00d0a5df228a236511
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762
Rename this API as discussed, since it's use cases extend beyond only
model parallelism.
ghstack-source-id: 94020627
Test Plan: Unit tests pass
Differential Revision: D18491743
fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29148
We would skip rpc.join_rpc() in the case of `clean_shutdown=False`.
This would exit the process without properly cleaning up the local RPCAgent
resulting in a crash.
As a result, to fix this we still call rpc.join_rpc() even in an unclean
shutdown. Note that, rpc.join_rpc() needs to be replaced with a local
`shutdown` call eventually since we need a way to shutdown the local RPC agent
properly.
Test Plan: waitforbuildbot
Reviewed By: xush6528
Differential Revision: D18306941
fbshipit-source-id: 2685db3924f7aa4516f3b28f58d6c127bcd55ba9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377
Test Plan: Added unit tests to test failures.
Differential Revision: D17916844
fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28606
Without passing setup_model_parallel=True to dist_init, it the
decorator actually takes function object as the value for the
flag.
Test Plan: Imported from OSS
Differential Revision: D18120507
Pulled By: mrshenli
fbshipit-source-id: afbaa381647e8f284e28fa9dbdd2a7c411073b3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226
# Goal
Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.
The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.
We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.
# Solution
- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.
Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```
```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```
```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```
Differential Revision: D5524494
fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27776
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.
The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.
I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.
Differential Revision: D5445858
fbshipit-source-id: 56ee24703abd8c5b366829430bef657e0f1dfeba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022
This change implements the "FAST" mode distributed autograd backward
pass as described in https://github.com/pytorch/pytorch/issues/23110.
At a high level the backward pass works as follows:
1. We start by computing dependencies on the node that calls
`torch.distributed.backward`.
2. This node computes the dependencies starting from the root nodes provided in
the backward call and all the 'send' functions present in the current autograd
context. The "FAST" mode assumes all 'send' functions are part of the autograd
computation.
3. Once the dependency computation is done, the distributed autograd engine
calls the local autograd engine to execute the autograd graph. Note that the
autograd graph on a single node is not necessarily connected because of
inter-node communication. As a result, we have special handling to ensure the
local autograd engine ensures we execute the entire graph starting from the
provided roots and all 'send' functions on the node.
4. When the local autograd engine hits a 'recv' function, it performs an async
RPC to send the gradients over to the appropriate node and stores a future in
the autograd context to keep track of this RPC.
5. On the destination node, the appropriate 'send' function is looked up and
enqueued on the local autograd engine. If this is the first time the node is
hearing about this autograd context id on the backward pass, then the node
computes dependencies for the local autograd engine.
6. As part of compute dependencies, the distributed autograd engine discovers
all leaf nodes and ensures those are passed as 'outputs' to the local autograd
engine. This avoids running the 'AccumulateGrad' function.
7. The gradients computed for the leaf nodes are then actually accumulated in
`DistAutogradContext` for the appropriate autograd context id.
8. The distributed autograd engine waits for the local autograd engine
to complete and also waits for all the 'Futures' (stored in 4.) for respective
RPCs to finish.
We have made the following changes to the local autograd engine for this
purpose:
1. Expose GraphTask and NodeTask so that the distributed autograd engine can
use them.
2. Expose a `execute_with_graph_task` API which gives the distributed engine
to build a GraphTask and pass it to the local autograd engine.
3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build
a `NodeTask` for a 'send' function and enqueue it on the local autograd engine.
In addition to this a few general improvements:
1. Added a `PropagateGradients` RPC call for the 'recv' function to pass
gradients to the appropriate node during the backward pass.
2. Use IValues as much as possible in serialization for RpcWithAutograd.
3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate
exception instead of just returning the message. This is inline with what most
Future.wait() APIs do.
4. Added a `get_gradients(context_id)` API which allows users to retrieve a map
from Tensor to respective gradient for the provided context_id on the local
node.
ghstack-source-id: 91794926
Test Plan: unit tests.
Differential Revision: D17652615
fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527
Master GH issue: https://github.com/pytorch/pytorch/issues/23110.
This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.
Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466
Test Plan: unit tests.
Differential Revision: D17148077
fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233