Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353
Temporarily removing this feature, will add this back after branch cut.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D23939865
Pulled By: mrshenli
fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45065
To preserve backwards compatibility with applications that were passing in some ProcessGroupRpcBackendOptions but were not explicitly setting backend=BackendType.PROCESS_GROUP, we're here now inferring the backend type from the options if only the latter ones are passed. If neither are passed, we'll default to TensorPipe, as before this change.
ghstack-source-id: 112586258
Test Plan: Added new unit tests.
Reviewed By: pritamdamania87
Differential Revision: D23814289
fbshipit-source-id: f4be7919e0817a4f539a50ab12216dc3178cb752
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45088Fixes#45082
Found a few problems while working on #44983
1. We deliberately swallow RPC timeouts during shutdown, as we haven't
found a good way to handle those. When we convert `_wait_all_workers`
into `_all_gather`, the same logic was inherited. However, as
`_all_gather` meant to be used in more general scenarios, we should
no longer keep silent about errors. This commit let the error throw
in `_all_gather` and also let `shutdown()` to catch them and log.
2. After fixing (1), I found that `UnpickledPythonCall` needs to
acquire GIL on destruction, and this can lead to deadlock when used
in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a
synchronization point which holds GIL. In `init_rpc`, followers
(`rank != 0`) can exit before the leader (`rank == 0`). If the two
happens together, we could get a) on a follower, it exits `init_rpc`
after running `_broadcast_to_followers` and before the reaching dtor
of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`,
which holds the GIL and wait for the leader to join. However, the
leader is waiting for the response from `_broadcast_to_followers`,
which is blocked by the dtor of `UnpickledPythonCall`. And hence
the deadlock. This commit drops the GIL in `ProcessGroup` ctor.
3. After fixing (2), I found that `TensorPipe` backend
nondeterministically fails with `test_local_shutdown`, due to a
similar reason as (2), but this time it is that `shutdown()` on a
follower runs before the leader finishes `init_rpc`. This commit
adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`.
The 3rd one should be able to solve the 2nd one as well. But since
I didn't see a reason to hold GIL during `ProcessGroup` ctor, I
made that change too.
Test Plan: Imported from OSS
Reviewed By: pritamdamania87
Differential Revision: D23825592
Pulled By: mrshenli
fbshipit-source-id: 94920f2ad357746a6b8e4ffaa380dd56a7310976
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637
This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.
Test Plan: Imported from OSS
Reviewed By: izdeby
Differential Revision: D23011572
Pulled By: mrshenli
fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40222
Mention the TensorPipe agent in the RPC docs and give users the information they need to choose which agent to use.
ghstack-source-id: 106225711
Test Plan: Export to GitHub, build locally and try out the docs.
Differential Revision: D22116494
fbshipit-source-id: 30703ba8410c40f64e785f60d71dfd9faa8de4a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40162
The only public option is `num_worker_threads`. The other ones are private (as indicated by the leading underscore, is that enough?) and allow to specify a different set and order of transports/channels. These can thus be used to disable a backend (by not specifying it) or by forcing one (by raising its priority). They can therefore be used to work around defective backends, in case we'll find any post-release.
ghstack-source-id: 106103238
Test Plan: Built //caffe2:ifbpy and, using TensorPipe's verbose logging, verified that the transports/channels I specified were indeed the ones that were being registered.
Differential Revision: D22090661
fbshipit-source-id: 789bbe3bde4444cfa20c40276246e4ab67c50cd0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39909
As described in https://github.com/pytorch/pytorch/issues/33583,
ProcessGroupAgent initializes the default process group and this causes issues
if the user initializes the default process group themsleves. Either the RPC
initialization would fail or the user's process group initialization would
fail.
To avoid this, I've changed ProcessGroupAgent init to create its own
ProcessGroupGloo and not use the default one at all.
Closes: https://github.com/pytorch/pytorch/issues/33583
ghstack-source-id: 105953303
Test Plan: waitforbuildbot
Differential Revision: D22011868
fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38933
Based on what I could understand from how the RPC shutdown operates and from what the ProcessGroup agent does, the join method is supposed to act as a barrier among all workers that waits until they all have finished all their pending work, including work that may be triggered by nested calls or by callbacks.
ghstack-source-id: 104760684
Test Plan: Before this diff, the `test_user_rrefs_confirmed` test of the RPC suite was flakily deadlocking. After this, I haven't been able to repro that.
Differential Revision: D21703020
fbshipit-source-id: 3d36c6544f1ba8e17ce27ef520ecfd30552045dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38052
The initial version of the TensorPipe agent required the user to specify the full map between workers' names and their ids, on each worker. However it's enough for each worker to just specify their name and id, as these can then be exchanged using the store.
Addresses #37784, although I think we can go further and use the store to also automatically assign ranks to workers, so that the user only needs to specify a name.
ghstack-source-id: 103741595
(Note: this ignores all push blocking failures!)
Test Plan:
On worker 0:
```
In [1]: import os
...: import torch
...: import torch.distributed.rpc as rpc
...: os.environ["MASTER_ADDR"] = "127.0.0.1"
...: os.environ["MASTER_PORT"] = "8765"
In [2]: rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)
In [3]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2)))
Out[3]:
tensor([[3., 3.],
[3., 3.]])
In [4]: rpc.rpc_sync("bar", torch.add, args=(1, 2))
Out[4]: 3
```
On worker 1:
```
In [1]: import os
...: import torch
...: import torch.distributed.rpc as rpc
...: os.environ["MASTER_ADDR"] = "127.0.0.1"
...: os.environ["MASTER_PORT"] = "8765"
In [2]: rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2)
```
Then also tested by adding `rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method="file:///tmp/init/foo")` to `rpc_init`.
Differential Revision: D21463833
fbshipit-source-id: b53d7af6fc060789358ac845aa1898ddea6e8f31
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35483
Implement the initial version of TensorPipe RPC agent, and register to RPC registry to expose to Python interface. As a starter, it utilizes all available TensorPipe transports (shm, uv) and channels (basic, cma).
Test Plan:
https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/experimental/jiayisuse/tensorpipe_rpc
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=28500
buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:main
./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/main.par
buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:benchmark
./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/benchmark.par
Multiple connections with async echo
./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/async_echo.par
Reviewed By: lw
Differential Revision: D20088366
fbshipit-source-id: 980f641af3321ca93583c62753e1c9174b7d4afc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450
It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164
Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`
Differential Revision: D21270127
fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027
The RPC timeout passed into rpc_sync and rpc_async after the below
change is now float, so we should make these APIs consistent.
ghstack-source-id: 102971906
Test Plan:
Existing unittests, also added unittest testing specific timeout set
in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling.
Differential Revision: D21125171
fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34081
Before this commit, applications have to do the following to configure
number of threads in ProcessGroup RPC backend:
```
op = ProcessGroupRpcBackendOptions()
op.rpc_timeout = rpc_timeout
op.init_method = init_method
op.num_send_recv_threads = 32
init_rpc(...., rpc_backend_options=op)
```
After this commit, it can be simplified to:
```
init_rpc(...., rpc_backend_options=ProcessGroupRpcBackendOptions(num_send_recv_threads=32))
```
Fixes#34075
Test Plan: Imported from OSS
Differential Revision: D20227344
Pulled By: mrshenli
fbshipit-source-id: def4318e987179b8c8ecca44d7ff935702c8a6e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32957
Closes https://github.com/pytorch/pytorch/issues/29703. If there is a
gloo timeout and `recvWork->wait()` times out in `listenLoop()`,
processGroupagent crashes since there is an unhandled exception in a thread.
This catches the exception and exits the listen loop. In a follow up diff, we
will enhance these error conditions so that if users attempt to send RPCs
again, they are notified that the RPC agent was in a bad state and it was
shutdown.
This PR also adds a new option, `processGroupTimeout` to PG agent's backend
options. This allows us to control the gloo timeout.
ghstack-source-id: 98236783
Test Plan: Added a unit test.
Differential Revision: D19678979
fbshipit-source-id: 3895ae754f407b84aca76c6ed3cb087d19178c40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208
Adds default arg for init_method so users don't have to pass this in,
and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs.
ghstack-source-id: 94500475
Test Plan: Unit tests pass.
Reviewed By: mrshenli
Differential Revision: D18630074
fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201
Provide a default constructor so that users don't have to construct
RPC agent options. Also rename this to RPCBackend Options as suggested.
ghstack-source-id: 94411768
Test Plan: Unit tests pass.
Differential Revision: D18628698
fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30243
Before this commit, rpc docs shows init_rpc as the following:
```
torch.distributed.rpc.init_rpc(
name,
backend=<BackendType.PROCESS_GROUP: BackendValue(
construct_rpc_agent_options_handler=<function _process_group_construct_rpc_agent_options_handler>,
init_backend_handler=<function _process_group_init_backend_handler>)>,
init_method=None,
rank=-1,
world_size=None,
rpc_agent_options=None
)
```
It unnecessarily leaks implementation details. This commit adds a
__repr__ function to BackendType Enum class to address this problem.
closes#29905
Test Plan: Imported from OSS
Differential Revision: D18641559
Pulled By: mrshenli
fbshipit-source-id: 19bf8a2d21c8207f026d097d8e3f077578d53106
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.
To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295
Test Plan:
### OSS RPC + RRef tests
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```
### Prototype RRef tests
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```
### Dist autograd
```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```
Differential Revision: D18595578
fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762
Rename this API as discussed, since it's use cases extend beyond only
model parallelism.
ghstack-source-id: 94020627
Test Plan: Unit tests pass
Differential Revision: D18491743
fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341
So that other RpcAgent could use this timeout setting as well.
ghstack-source-id: 93481902
Differential Revision: D5681951
fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392
Per #25531, we want to clean up futures when we detect that there are
failures/timeouts. As a first step, this diff adds timers to the future object,
provides functionality to check if a future is timed out, and allows
specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out.
ghstack-source-id: 93192622
Test Plan: Added unit tests.
Differential Revision: D18025163
fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33