Commit Graph

25 Commits

Author SHA1 Message Date
Rohan Varma
1350b99de4 Add local shutdown to process group agent (#30330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94673884
ghstack-source-id: 94673884

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18661775

fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
2019-11-27 22:34:08 -08:00
Shen Li
efe1859ad9 By default ignore RRef leaks during shutdown (#30217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30217

Before this commit, RRefContext throws an error if it detects any
RRef leak during shutdown. However, this requires applications to
make sure that is has freed all references to RRefs in application
code, which can be a bad debugging experience when for large
applications. Besides, this also relies on Python GC to free things
up in time, which might not always be true. After this commit,
RRefContext would ignore leaking RRefs during shutdown, as shutdown
is called when the application has finished training and no longer
care about local states. Hence, it should be OK to just ignore
those leaks and destroy OwnerRRefs. If application would like to
enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak
to False.

Test Plan: Imported from OSS

Differential Revision: D18632546

Pulled By: mrshenli

fbshipit-source-id: 2744b2401dafdd16de0e0a76cf8e07777bed0f38
2019-11-26 06:53:58 -08:00
Rohan Varma
5c6705e62c add default arg for init_method (#30208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208

Adds default arg for init_method so users don't have to pass this in,
and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs.
ghstack-source-id: 94500475

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18630074

fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a
2019-11-25 14:52:48 -08:00
Shen Li
a9f3f48f88 Revert D5578006: Add local shutdown to process group agent
Test Plan: revert-hammer

Differential Revision:
D5578006

Original commit changeset: 6258879fb44c

fbshipit-source-id: 11b893b3a280a8383eeb20a0548626811616dca1
2019-11-22 11:31:04 -08:00
Rohan Varma
c478a92b93 Add local shutdown to process group agent (#30020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30020
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.

ghstack-source-id: 94415336

Test Plan: Unit tests pass.

Differential Revision: D5578006

fbshipit-source-id: 6258879fb44c9fca97fdfad64468c1488c16ac02
2019-11-22 10:03:00 -08:00
Rohan Varma
f41422121e default construct rpc agent options based on the backend type (#30201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201

Provide a default constructor so that users don't have to construct
RPC agent options. Also rename this to RPCBackend Options as suggested.
ghstack-source-id: 94411768

Test Plan: Unit tests pass.

Differential Revision: D18628698

fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446
2019-11-22 08:18:06 -08:00
Rohan Varma
f304bd5062 rename join_rpc to wait_all_workers in public api (#30050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30050

Renames this API to wait_all_workers as discussed.
ghstack-source-id: 94273005

Test Plan: Unit tests pass

Differential Revision: D18581466

fbshipit-source-id: 4ff5d5fb2d528f17252d5b5f30c3047d2efb92bf
2019-11-20 12:38:35 -08:00
Shihao Xu
80e3f17301 Resubmit "Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents" (#30093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093

https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295

Test Plan:
### OSS RPC + RRef tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```

### Prototype RRef tests

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```

### Dist autograd

```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```

Differential Revision: D18595578

fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
2019-11-19 18:52:30 -08:00
Shihao Xu
868cb05a30 Resubmit "Add RpcAgentTestFixture to extract duplicate code" (#30092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30092

There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.
ghstack-source-id: 94196891

Test Plan:
### RPC + RRef

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck test mode/dev-nosan //caffe2/test:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```

### Dist Autograd

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```

### Dist Optimizer

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```

Differential Revision: D18595408

fbshipit-source-id: 8360759c63e838fb19d4eb1aeacca0bf8eb4b55f
2019-11-19 16:24:51 -08:00
Edward Yang
7d287688eb Revert D5689636: Add RpcAgentTestFixture to extract duplicate code
Test Plan: revert-hammer

Differential Revision:
D5689636

Original commit changeset: f35eea1359ad

fbshipit-source-id: 31928fce5e96b3beceefbc9a03f54769f10b7e1a
2019-11-19 08:14:44 -08:00
Edward Yang
1dda8186ae Revert D18549919: Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents
Test Plan: revert-hammer

Differential Revision:
D18549919

Original commit changeset: b9f3f1a41d1f

fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee
2019-11-19 08:14:40 -08:00
Shihao Xu
21dc1d4543 Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents (#29972)
Summary:
https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.

closes https://github.com/pytorch/pytorch/issues/29031
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972

Differential Revision: D18549919

Pulled By: xush6528

fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b
2019-11-19 01:00:08 -08:00
Shihao Xu
8dd67057f1 Add RpcAgentTestFixture to extract duplicate code (#29747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29747

There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.

Test Plan:
### RPC + RRef

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck test mode/dev-nosan //caffe2/test:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```

### Dist Autograd

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```

### Dist Optimizer

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```

Differential Revision: D5689636

fbshipit-source-id: f35eea1359addaaac9bd8d00d0a5df228a236511
2019-11-18 12:54:17 -08:00
Rohan Varma
639133d6d1 rename init_model_parallel to init_rpc (#29762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762

Rename this API as discussed, since it's use cases extend beyond only
model parallelism.
ghstack-source-id: 94020627

Test Plan: Unit tests pass

Differential Revision: D18491743

fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57
2019-11-18 06:07:44 -08:00
Pritam Damania
310343e946 Properly shutdown RPC even in the case of clean_shutdown=False. (#29148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29148

We would skip rpc.join_rpc() in the case of `clean_shutdown=False`.
This would exit the process without properly cleaning up the local RPCAgent
resulting in a crash.

As a result, to fix this we still call rpc.join_rpc() even in an unclean
shutdown. Note that, rpc.join_rpc() needs to be replaced with a local
`shutdown` call eventually since we need a way to shutdown the local RPC agent
properly.

Test Plan: waitforbuildbot

Reviewed By: xush6528

Differential Revision: D18306941

fbshipit-source-id: 2685db3924f7aa4516f3b28f58d6c127bcd55ba9
2019-11-11 11:30:48 -08:00
Shihao Xu
8f1564b8ab Add enum type to rpc registry for consolidating RPC initialization code path (#28628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28628

Consolidate code paths of ProcessGroupAgent construction and other RPC Backend construction.
ghstack-source-id: 92845348

Differential Revision: D5516188

fbshipit-source-id: 151d9b7b74f68631d6673fecc74dec525949b8f0
2019-10-29 17:26:15 -07:00
Pritam Damania
1322daa506 Improve error handling for distributed autograd engine. (#27940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377

Test Plan: Added unit tests to test failures.

Differential Revision: D17916844

fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
2019-10-25 12:07:27 -07:00
Shen Li
261a13a84b Enable dist autograd tests (#28606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28606

Without passing setup_model_parallel=True to dist_init, it the
decorator actually takes function object as the value for the
flag.

Test Plan: Imported from OSS

Differential Revision: D18120507

Pulled By: mrshenli

fbshipit-source-id: afbaa381647e8f284e28fa9dbdd2a7c411073b3f
2019-10-24 15:30:27 -07:00
Shihao Xu
59402f51cf Make init_method url appending step re-usable by both init_process_group and init_model_parallel(init_rpc) (#28226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226

# Goal

Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.

The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.

We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.

# Solution

- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.

Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```

```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```

```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```

Differential Revision: D5524494

fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
2019-10-23 21:51:08 -07:00
Shihao Xu
3523e5427a Add master to OSS RPC test (#27776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27776

I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: D5445858

fbshipit-source-id: 56ee24703abd8c5b366829430bef657e0f1dfeba
2019-10-16 13:45:45 -07:00
Pritam Damania
3bccd3fc0d Distributed Autograd - FAST mode backward pass implementation. (#27022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022

This change implements the "FAST" mode distributed autograd backward
pass as described in https://github.com/pytorch/pytorch/issues/23110.

At a high level the backward pass works as follows:
1. We start by computing dependencies on the node that calls
`torch.distributed.backward`.
2. This node computes the dependencies starting from the root nodes provided in
the backward call and all the 'send' functions present in the current autograd
context. The "FAST" mode assumes all 'send' functions are part of the autograd
computation.
3. Once the dependency computation is done, the distributed autograd engine
calls the local autograd engine to execute the autograd graph. Note that the
autograd graph on a single node is not necessarily connected because of
inter-node communication. As a result, we have special handling to ensure the
local autograd engine ensures we execute the entire graph starting from the
provided roots and all 'send' functions on the node.
4. When the local autograd engine hits a 'recv' function, it performs an async
RPC to send the gradients over to the appropriate node and stores a future in
the autograd context to keep track of this RPC.
5. On the destination node, the appropriate 'send' function is looked up and
enqueued on the local autograd engine. If this is the first time the node is
hearing about this autograd context id on the backward pass, then the node
computes dependencies for the local autograd engine.
6. As part of compute dependencies, the distributed autograd engine discovers
all leaf nodes and ensures those are passed as 'outputs' to the local autograd
engine. This avoids running the 'AccumulateGrad' function.
7. The gradients computed for the leaf nodes are then actually accumulated in
`DistAutogradContext` for the appropriate autograd context id.
8. The distributed autograd engine waits for the local autograd engine
to complete and also waits for all the 'Futures' (stored in 4.) for respective
RPCs to finish.

We have made the following changes to the local autograd engine for this
purpose:

1. Expose GraphTask and NodeTask so that the distributed autograd engine can
use them.
2. Expose a `execute_with_graph_task` API which gives the distributed engine
to build a GraphTask and pass it to the local autograd engine.
3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build
a `NodeTask` for a 'send' function and enqueue it on the local autograd engine.

In addition to this a few general improvements:
1. Added a `PropagateGradients` RPC call for the 'recv' function to pass
gradients to the appropriate node during the backward pass.
2. Use IValues as much as possible in serialization for RpcWithAutograd.
3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate
exception instead of just returning the message. This is inline with what most
Future.wait() APIs do.
4. Added a `get_gradients(context_id)` API which allows users to retrieve a map
from Tensor to respective gradient for the provided context_id on the local
node.
ghstack-source-id: 91794926

Test Plan: unit tests.

Differential Revision: D17652615

fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3
2019-10-12 09:47:49 -07:00
Shihao Xu
130127ca59 Rename BACKEND to be RPC_BACKEND to be seperated from COMMUNICATION_BACKEND like gloo,nccl, in rpc_test.py (#27792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27792

Close https://github.com/pytorch/pytorch/issues/27232
ghstack-source-id: 91807741

Differential Revision: D5474297

fbshipit-source-id: 5b230a6857813ec981e5056880abb5859655daa2
2019-10-11 19:49:46 -07:00
Pieter Noordhuis
b4ce922b58 Move RPC API to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27290

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D17808212

Pulled By: pietern

fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454
2019-10-08 11:31:25 -07:00
Shihao Xu
e166bcbbde Make RpcTest re-usable by other RPC backends by using init_method to initialize a RPC backend (#27320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27320

https://github.com/pytorch/pytorch/pull/27208/

# Problem

Other RPC backends take init_method.

# Solution

Set up init_method in rpc tests.
ghstack-source-id: 91335127

Differential Revision: D17709219

fbshipit-source-id: 3184c6e9b922a6ff9f4d1cb9abfa118b23f43eeb
2019-10-04 09:20:05 -07:00
Pritam Damania
fe4170bda8 Add send and recv backward functions for builtin operators RPC. (#25527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527

Master GH issue: https://github.com/pytorch/pytorch/issues/23110.

This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466

Test Plan: unit tests.

Differential Revision: D17148077

fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233
2019-10-03 01:18:46 -07:00