Commit Graph

31 Commits

Author SHA1 Message Date
Xu Zhao
eaa993a2e0 Add type annotations to torch._C._distributed_rpc module. (#46624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46624

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24761656

Pulled By: xuzhao9

fbshipit-source-id: b55aee5dd2b97f573a50e5bbfddde7d984943fec
2020-11-06 01:28:51 -08:00
Shen Li
8cb7280242 Revert "Remove device maps from TensorPipe for v1.7 release (#45353)" (#45762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762

This reverts commit 5211fb97ac.

Test Plan: Imported from OSS

Reviewed By: colesbury

Differential Revision: D24088231

Pulled By: mrshenli

fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4
2020-10-02 15:14:05 -07:00
Shen Li
5211fb97ac Remove device maps from TensorPipe for v1.7 release (#45353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353

Temporarily removing this feature, will add this back after branch cut.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23939865

Pulled By: mrshenli

fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
2020-09-25 16:51:45 -07:00
Luca Wehrstedt
76dc50e9c8 [RPC] Infer backend type if only options are given (#45065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45065

To preserve backwards compatibility with applications that were passing in some ProcessGroupRpcBackendOptions but were not explicitly setting backend=BackendType.PROCESS_GROUP, we're here now inferring the backend type from the options if only the latter ones are passed. If neither are passed, we'll default to TensorPipe, as before this change.
ghstack-source-id: 112586258

Test Plan: Added new unit tests.

Reviewed By: pritamdamania87

Differential Revision: D23814289

fbshipit-source-id: f4be7919e0817a4f539a50ab12216dc3178cb752
2020-09-23 00:46:27 -07:00
Shen Li
09e7f62ce2 Fix RPC and ProcessGroup GIL deadlock (#45088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45088

Fixes #45082

Found a few problems while working on #44983

1. We deliberately swallow RPC timeouts during shutdown, as we haven't
found a good way to handle those. When we convert `_wait_all_workers`
into `_all_gather`, the same logic was inherited. However, as
`_all_gather` meant to be used in more general scenarios, we should
no longer keep silent about errors. This commit let the error throw
in `_all_gather` and also let `shutdown()` to catch them and log.
2. After fixing (1), I found that `UnpickledPythonCall` needs to
acquire GIL on destruction, and this can lead to deadlock when used
in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a
synchronization point which holds GIL. In `init_rpc`, followers
(`rank != 0`) can exit before the leader (`rank == 0`). If the two
happens together, we could get a) on a follower, it exits `init_rpc`
after running `_broadcast_to_followers` and before the reaching dtor
of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`,
which holds the GIL and wait for the leader to join. However, the
leader is waiting for the response from `_broadcast_to_followers`,
which is blocked by the dtor of `UnpickledPythonCall`. And hence
the deadlock. This commit drops the GIL in `ProcessGroup` ctor.
3. After fixing (2), I found that `TensorPipe` backend
nondeterministically fails with `test_local_shutdown`, due to a
similar reason as (2), but this time it is that `shutdown()` on a
follower runs before the leader finishes `init_rpc`. This commit
adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`.

The 3rd one should be able to solve the 2nd one as well. But since
I didn't see a reason to hold GIL during `ProcessGroup` ctor, I
made that change too.

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D23825592

Pulled By: mrshenli

fbshipit-source-id: 94920f2ad357746a6b8e4ffaa380dd56a7310976
2020-09-21 21:47:27 -07:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Shen Li
5006d24302 Make TensorPipe the default backend for RPC (#43246)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43246

Test Plan: Imported from OSS

Reviewed By: osalpekar

Differential Revision: D23206042

Pulled By: osalpekar

fbshipit-source-id: 258481ea9e753cd36c2787183827ca3b81d678e3
2020-08-20 14:17:02 -07:00
Shen Li
06aaf8c20d Add set_device_map to TensorPipeOptions to support GPU args (#42637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637

This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23011572

Pulled By: mrshenli

fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
2020-08-14 18:46:55 -07:00
Luca Wehrstedt
2393bab036 [TensorPipe] Update documentation (#40222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40222

Mention the TensorPipe agent in the RPC docs and give users the information they need to choose which agent to use.
ghstack-source-id: 106225711

Test Plan: Export to GitHub, build locally and try out the docs.

Differential Revision: D22116494

fbshipit-source-id: 30703ba8410c40f64e785f60d71dfd9faa8de4a1
2020-06-19 04:26:49 -07:00
Luca Wehrstedt
7c9e78fdf5 [TensorPipe] Add options for agent, including backend killswitches (#40162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40162

The only public option is `num_worker_threads`. The other ones are private (as indicated by the leading underscore, is that enough?) and allow to specify a different set and order of transports/channels. These can thus be used to disable a backend (by not specifying it) or by forcing one (by raising its priority). They can therefore be used to work around defective backends, in case we'll find any post-release.
ghstack-source-id: 106103238

Test Plan: Built //caffe2:ifbpy and, using TensorPipe's verbose logging, verified that the transports/channels I specified were indeed the ones that were being registered.

Differential Revision: D22090661

fbshipit-source-id: 789bbe3bde4444cfa20c40276246e4ab67c50cd0
2020-06-18 02:54:17 -07:00
Pritam Damania
145df306ae Avoid using default process group in ProcessGroupAgent. (#39909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39909

As described in https://github.com/pytorch/pytorch/issues/33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.

To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.

Closes: https://github.com/pytorch/pytorch/issues/33583
ghstack-source-id: 105953303

Test Plan: waitforbuildbot

Differential Revision: D22011868

fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f
2020-06-16 12:00:29 -07:00
Luca Wehrstedt
54046c1024 [TensorPipe] Implement join correctly (#38933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38933

Based on what I could understand from how the RPC shutdown operates and from what the ProcessGroup agent does, the join method is supposed to act as a barrier among all workers that waits until they all have finished all their pending work, including work that may be triggered by nested calls or by callbacks.

ghstack-source-id: 104760684

Test Plan: Before this diff, the `test_user_rrefs_confirmed` test of the RPC suite was flakily deadlocking. After this, I haven't been able to repro that.

Differential Revision: D21703020

fbshipit-source-id: 3d36c6544f1ba8e17ce27ef520ecfd30552045dd
2020-05-28 10:48:13 -07:00
Luca Wehrstedt
91f451a5e6 [TensorPipe] Do not require user to provide worker name-to-rank map (#38052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38052

The initial version of the TensorPipe agent required the user to specify the full map between workers' names and their ids, on each worker. However it's enough for each worker to just specify their name and id, as these can then be exchanged using the store.

Addresses #37784, although I think we can go further and use the store to also automatically assign ranks to workers, so that the user only needs to specify a name.
ghstack-source-id: 103741595

(Note: this ignores all push blocking failures!)

Test Plan:
On worker 0:
```
In [1]: import os
   ...: import torch
   ...: import torch.distributed.rpc as rpc
   ...: os.environ["MASTER_ADDR"] = "127.0.0.1"
   ...: os.environ["MASTER_PORT"] = "8765"

In [2]: rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)

In [3]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2)))
Out[3]:
tensor([[3., 3.],
        [3., 3.]])

In [4]: rpc.rpc_sync("bar", torch.add, args=(1, 2))
Out[4]: 3
```
On worker 1:
```
In [1]: import os
   ...: import torch
   ...: import torch.distributed.rpc as rpc
   ...: os.environ["MASTER_ADDR"] = "127.0.0.1"
   ...: os.environ["MASTER_PORT"] = "8765"

In [2]: rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)
```

Then also tested by adding `rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method="file:///tmp/init/foo")` to `rpc_init`.

Differential Revision: D21463833

fbshipit-source-id: b53d7af6fc060789358ac845aa1898ddea6e8f31
2020-05-08 10:48:48 -07:00
Hongyi Jia
0549e1f384 [Tensorpipe/RPC] tensorpipe RPC agent (#35483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35483

Implement the initial version of TensorPipe RPC agent, and register to RPC registry to expose to Python interface. As a starter, it utilizes all available TensorPipe transports (shm, uv) and channels (basic, cma).

Test Plan:
https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/experimental/jiayisuse/tensorpipe_rpc
  export MASTER_ADDR=127.0.0.1
  export MASTER_PORT=28500
  buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:main
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/main.par
  buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:benchmark
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/benchmark.par

Multiple connections with async echo
  ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/async_echo.par

Reviewed By: lw

Differential Revision: D20088366

fbshipit-source-id: 980f641af3321ca93583c62753e1c9174b7d4afc
2020-05-05 05:47:43 -07:00
Rohan Varma
c0a985fcd6 Allow customizing retryable message types in Faulty agent tests (#37450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450

It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164

Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`

Differential Revision: D21270127

fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
2020-05-01 12:00:36 -07:00
Rohan Varma
4ff4119d45 [rpc] Move _set_rpc_backand and RpcBackendOptions to use float instead of timedelta (#37027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027

The RPC timeout passed into rpc_sync and rpc_async after the below
change is now float, so we should make these APIs consistent.
ghstack-source-id: 102971906

Test Plan:
Existing unittests, also added unittest testing specific timeout set
in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling.

Differential Revision: D21125171

fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1
2020-04-27 19:38:06 -07:00
Shen Li
f1085a8e41 Improve ProcessGroup RpcBackendOptions Constructor API (#34081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34081

Before this commit, applications have to do the following to configure
number of threads in ProcessGroup RPC backend:

```
op = ProcessGroupRpcBackendOptions()
op.rpc_timeout = rpc_timeout
op.init_method = init_method
op.num_send_recv_threads = 32
init_rpc(...., rpc_backend_options=op)
```

After this commit, it can be simplified to:

```
init_rpc(...., rpc_backend_options=ProcessGroupRpcBackendOptions(num_send_recv_threads=32))
```

Fixes #34075

Test Plan: Imported from OSS

Differential Revision: D20227344

Pulled By: mrshenli

fbshipit-source-id: def4318e987179b8c8ecca44d7ff935702c8a6e7
2020-03-03 16:43:29 -08:00
Rohan Varma
eb9b4b1f29 handle errors in ProcessGroupAgent::listenLoop(). (#32957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32957

Closes https://github.com/pytorch/pytorch/issues/29703. If there is a
gloo timeout and `recvWork->wait()` times out in `listenLoop()`,
processGroupagent crashes since there is an unhandled exception in a thread.
This catches the exception and exits the listen loop. In a follow up diff, we
will enhance these error conditions so that if users attempt to send RPCs
again, they are notified that the RPC agent was in a bad state and it was
shutdown.

This PR also adds a new option, `processGroupTimeout` to PG agent's backend
options. This allows us to control the gloo timeout.
ghstack-source-id: 98236783

Test Plan: Added a unit test.

Differential Revision: D19678979

fbshipit-source-id: 3895ae754f407b84aca76c6ed3cb087d19178c40
2020-02-13 14:50:05 -08:00
Rohan Varma
5c6705e62c add default arg for init_method (#30208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208

Adds default arg for init_method so users don't have to pass this in,
and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs.
ghstack-source-id: 94500475

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18630074

fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a
2019-11-25 14:52:48 -08:00
Rohan Varma
f41422121e default construct rpc agent options based on the backend type (#30201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201

Provide a default constructor so that users don't have to construct
RPC agent options. Also rename this to RPCBackend Options as suggested.
ghstack-source-id: 94411768

Test Plan: Unit tests pass.

Differential Revision: D18628698

fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446
2019-11-22 08:18:06 -08:00
Shen Li
fea963d3ae Fix BackendType repr in doc (#30243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30243

Before this commit, rpc docs shows init_rpc as the following:

```
torch.distributed.rpc.init_rpc(
   name,
   backend=<BackendType.PROCESS_GROUP: BackendValue(
     construct_rpc_agent_options_handler=<function _process_group_construct_rpc_agent_options_handler>,
     init_backend_handler=<function _process_group_init_backend_handler>)>,
   init_method=None,
   rank=-1,
   world_size=None,
   rpc_agent_options=None
)
```

It unnecessarily leaks implementation details. This commit adds a
__repr__ function to BackendType Enum class to address this problem.

closes #29905

Test Plan: Imported from OSS

Differential Revision: D18641559

Pulled By: mrshenli

fbshipit-source-id: 19bf8a2d21c8207f026d097d8e3f077578d53106
2019-11-21 16:22:43 -08:00
Shihao Xu
80e3f17301 Resubmit "Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents" (#30093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093

https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295

Test Plan:
### OSS RPC + RRef tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```

### Prototype RRef tests

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```

### Dist autograd

```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```

Differential Revision: D18595578

fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
2019-11-19 18:52:30 -08:00
Edward Yang
1dda8186ae Revert D18549919: Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents
Test Plan: revert-hammer

Differential Revision:
D18549919

Original commit changeset: b9f3f1a41d1f

fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee
2019-11-19 08:14:40 -08:00
Shihao Xu
21dc1d4543 Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents (#29972)
Summary:
https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.

closes https://github.com/pytorch/pytorch/issues/29031
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972

Differential Revision: D18549919

Pulled By: xush6528

fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b
2019-11-19 01:00:08 -08:00
Rohan Varma
639133d6d1 rename init_model_parallel to init_rpc (#29762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762

Rename this API as discussed, since it's use cases extend beyond only
model parallelism.
ghstack-source-id: 94020627

Test Plan: Unit tests pass

Differential Revision: D18491743

fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57
2019-11-18 06:07:44 -08:00
Rohan Varma
06ef4a757d Add docs for RPC, dist autograd, and RRef modules (#29276)
Summary:
Closes https://github.com/pytorch/pytorch/issues/28983. Documentation for `torch.distributed.rpc` and `torch.distributed.autograd` modules. Also fixes/tidies up some of the docstrings in rpc/autograd, and moves some functions to be private so they don't show up in the documentation.

Note: Much of the text to describe/explain the RPC/RRef layers are taken from the following RFCs: https://github.com/pytorch/pytorch/issues/23110, https://github.com/pytorch/pytorch/issues/26759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29276

Differential Revision: D18478754

Pulled By: rohan-varma

fbshipit-source-id: e9a7089baf5275304e5408d319eb9bf98e53fff8
2019-11-14 14:32:03 -08:00
Shihao Xu
e66626ae5c Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. (#29341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341

So that other RpcAgent could use this timeout setting as well.

ghstack-source-id: 93481902

Differential Revision: D5681951

fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc
2019-11-07 17:05:45 -08:00
Pieter Noordhuis
b4df413712 Scope pybind11 functions to torch.distributed.{autograd,rpc}
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27529

Test Plan: Imported from OSS

Differential Revision: D17808209

Pulled By: pietern

fbshipit-source-id: 1e3e086085167320c3fc369467f5d75ce39fa4ea
2019-11-05 06:25:22 -08:00
Rohan Varma
fd0f9811ad add timeout for RPC futures, and ability to set timeout when initializing rpc (#28392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392

Per #25531, we want to clean up futures when we detect that there are
failures/timeouts. As a first step, this diff adds timers to the future object,
provides functionality to check if a future is timed out, and allows
specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out.
ghstack-source-id: 93192622

Test Plan: Added unit tests.

Differential Revision: D18025163

fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33
2019-11-04 14:43:03 -08:00
Shihao Xu
8f1564b8ab Add enum type to rpc registry for consolidating RPC initialization code path (#28628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28628

Consolidate code paths of ProcessGroupAgent construction and other RPC Backend construction.
ghstack-source-id: 92845348

Differential Revision: D5516188

fbshipit-source-id: 151d9b7b74f68631d6673fecc74dec525949b8f0
2019-10-29 17:26:15 -07:00
Pieter Noordhuis
14f1629c4d Move RPC backend registry to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27288

Test Plan: Imported from OSS

Differential Revision: D17808215

Pulled By: pietern

fbshipit-source-id: 489c031e02cd3141a861cf7ec2273aaa4c55b7d7
2019-10-08 11:31:16 -07:00