Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30330
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.
ghstack-source-id: 94673884
ghstack-source-id: 94673884
Test Plan: Unit tests pass.
Reviewed By: mrshenli
Differential Revision: D18661775
fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30217
Before this commit, RRefContext throws an error if it detects any
RRef leak during shutdown. However, this requires applications to
make sure that is has freed all references to RRefs in application
code, which can be a bad debugging experience when for large
applications. Besides, this also relies on Python GC to free things
up in time, which might not always be true. After this commit,
RRefContext would ignore leaking RRefs during shutdown, as shutdown
is called when the application has finished training and no longer
care about local states. Hence, it should be OK to just ignore
those leaks and destroy OwnerRRefs. If application would like to
enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak
to False.
Test Plan: Imported from OSS
Differential Revision: D18632546
Pulled By: mrshenli
fbshipit-source-id: 2744b2401dafdd16de0e0a76cf8e07777bed0f38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208
Adds default arg for init_method so users don't have to pass this in,
and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs.
ghstack-source-id: 94500475
Test Plan: Unit tests pass.
Reviewed By: mrshenli
Differential Revision: D18630074
fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30241
We need an API to get all worker infos. This will be used by backend-agnostic `rpc.wait_all_workers()` API.
ghstack-source-id: 94454935
Test Plan:
# Unit tests
```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_get_worker_infos
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_get_worker_infos
```
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_get_worker_infos
buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_get_worker_infos
```
Differential Revision: D5693412
fbshipit-source-id: 5123c8248b6d44fd36b8a5f381dbabb2660e6f0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30020
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.
ghstack-source-id: 94415336
Test Plan: Unit tests pass.
Differential Revision: D5578006
fbshipit-source-id: 6258879fb44c9fca97fdfad64468c1488c16ac02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201
Provide a default constructor so that users don't have to construct
RPC agent options. Also rename this to RPCBackend Options as suggested.
ghstack-source-id: 94411768
Test Plan: Unit tests pass.
Differential Revision: D18628698
fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30261
With #29827, the flakiness should disappear for test_call_method_on_rref
Test Plan: Imported from OSS
Differential Revision: D18645036
Pulled By: mrshenli
fbshipit-source-id: 44d759062fc78b1a797266096dbb4ddd104f07eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30050
Renames this API to wait_all_workers as discussed.
ghstack-source-id: 94273005
Test Plan: Unit tests pass
Differential Revision: D18581466
fbshipit-source-id: 4ff5d5fb2d528f17252d5b5f30c3047d2efb92bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29930
Right now, python call remote exception rethrown is coupled with deserializtiaon.
For owner ref, the setValue() and getValue() do not use serialization and deserialization, so when users create a ref to itself, and call ownerRef.to_here(), python call remote exception will not be rethrown.
This diff is to move remote exception rethrown out of deserialization, and exception can be handled for ownerRef.localValue() or ownerRef.to_here()
close#29924
ghstack-source-id: 94210894
Test Plan: unit tests
Differential Revision: D18541916
fbshipit-source-id: 7cda93f623d52c740b3c1b1fa9a442f866984340
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.
To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295
Test Plan:
### OSS RPC + RRef tests
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```
### Prototype RRef tests
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```
### Dist autograd
```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```
Differential Revision: D18595578
fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30092
There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.
ghstack-source-id: 94196891
Test Plan:
### RPC + RRef
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
buck test mode/dev-nosan //caffe2/test:rpc_spawn
```
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```
### Dist Autograd
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```
### Dist Optimizer
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```
Differential Revision: D18595408
fbshipit-source-id: 8360759c63e838fb19d4eb1aeacca0bf8eb4b55f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30100
As after #29827 we only test RPC using spawn, the multi-thread/fork
error should disappear.
Test Plan: Imported from OSS
Differential Revision: D18597002
Pulled By: mrshenli
fbshipit-source-id: 64aa6a59248e5d1b7e1ad1aebffb6a25248388d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30099
As after #29827 we only test RPC using spawn, the multi-thread/fork
error should disappear.
Test Plan: Imported from OSS
Differential Revision: D18597003
Pulled By: mrshenli
fbshipit-source-id: ebfb1f6f3f961d98351e06ce4b951793a9b95398
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30098
As after #29827 we only test RPC using spawn, the multi-thread/fork
error should disappear.
Test Plan: Imported from OSS
Differential Revision: D18597001
Pulled By: mrshenli
fbshipit-source-id: 68256289085fac1a9ca76d5b4882e97e2f81d1f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29601
Follow up from https://github.com/pytorch/pytorch/pull/28392. Adds a background thread to `ProcessGroupAgent` that polls for timed out RPCs at a pre-set interval, and marks them as completed with a timeout exception if they have timed out. Also deletes the futures from the corresponding maps `futures_` and `futureTimeouts`. Unit tests are added to ensure that timed out RPCs are appropriately cleaned up.
Also adds a `shutdown` variable to process group agent to control the shutting down of this background thread, which can eventually be extended to use for controlling a clean shutdown of process group agent.
ghstack-source-id: 94175131
Test Plan: Added unit tests
Differential Revision: D18434215
fbshipit-source-id: c48abdb8759fe1447200ec66bb9d4b1c50ec4535
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29958
DistributedOptimizer relies on hashing WorkerInfo in order to coalesce fan-out RPCs. This will likely be a very common use case (EASGD will do the same, for example).
ghstack-source-id: 94169198
Test Plan: unit test.
Differential Revision: D18548257
fbshipit-source-id: 7d67d4e1b9bc60403c372164982a75ae8c1d8389
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30033
Removing this API for now since we don't have a concrete use-case for
this yet and as a result exposing this as a public API might result in users
depending on this API.
We can always add some variant of this API back if needed later.
ghstack-source-id: 94138302
Test Plan: waitforbuildbot
Differential Revision: D18578056
fbshipit-source-id: 078c62331725e03bd5702624afc16b1cdcdf26a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29747
There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.
Test Plan:
### RPC + RRef
```
buck test mode/dev-nosan //caffe2/test:rpc_fork
buck test mode/dev-nosan //caffe2/test:rpc_spawn
```
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```
### Dist Autograd
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```
### Dist Optimizer
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```
```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift
buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```
Differential Revision: D5689636
fbshipit-source-id: f35eea1359addaaac9bd8d00d0a5df228a236511
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762
Rename this API as discussed, since it's use cases extend beyond only
model parallelism.
ghstack-source-id: 94020627
Test Plan: Unit tests pass
Differential Revision: D18491743
fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29634
This implementation supports rpc.remote to self by doing the
following steps:
1. create an owner RRef
2. add the owner RRef to owners_ in RRefContext, and keep it alive
by using RRefId as the ForkId.
3. Go through serde and insert the message to the caller's thread-pool
4. When the response message gets processed, remove the itself from
RRef fork map.
Test Plan: Imported from OSS
Differential Revision: D18445812
Pulled By: mrshenli
fbshipit-source-id: e3b9aa98962c388acbc2ce294101a236d5cb2da6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28948
Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 93565010
Test Plan: unit test.
Differential Revision: D18241227
fbshipit-source-id: b9e9b958f40623348d62ee6fc9e7f0414b4215b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29485
The flakiness is likely due to the problem with OMP and fork. We
should disable fork tests for good, but that would have negative
impact on internal test coverage. This commit disables the most
buggy nested tests for now, until we find a way to turn fork test
off.
Test Plan: Imported from OSS
Differential Revision: D18407529
Pulled By: mrshenli
fbshipit-source-id: dcbe49a9d104fcf1eaf83107d58904d49dc18aff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29253
Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.
That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93518076
Test Plan: buck test mode/dev-nosan caffe2/test/...
Differential Revision: D18339715
fbshipit-source-id: 08ade40e81da378b003a550c898a726e99d50e34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29069
Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.
Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.
ghstack-source-id: 93535922
Test Plan: waitforbuildbot
Differential Revision: D18280875
fbshipit-source-id: 739a1c22dec21df859738d074e6e497fa43257fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29396
The return types of RRef.to_here()/local_value() were recently
changed to Future, which triggers flakiness as the RRef could be
deleted before the future.wait() finishes. While we are still
discussing how we'd like to solve it, this commit reverts the
return type to stop bleeding in tests.
closes#28885
Test Plan: Imported from OSS
Differential Revision: D18375571
Pulled By: mrshenli
fbshipit-source-id: 354dbf38b15ab804e44fc9968dd30888415c1fab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341
So that other RpcAgent could use this timeout setting as well.
ghstack-source-id: 93481902
Differential Revision: D5681951
fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29157
As reported, these tests are flaky and time out. Skip them
while we investigate further.
ghstack-source-id: 93287663
Test Plan: CI
Differential Revision: D18309204
fbshipit-source-id: 95f0ea5e0c1162b78da412a34db446a01dfc33bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29139
Each test has 100 sec timeout.
Current this test takes 90~110 secs to finish, causing flakiness.
Half the load to make it not on the edge of timeout.
ghstack-source-id: 93203670
Differential Revision: D5644012
fbshipit-source-id: 2a85999cf1ae6d18e9a871cd76ce194e1ce7b3e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392
Per #25531, we want to clean up futures when we detect that there are
failures/timeouts. As a first step, this diff adds timers to the future object,
provides functionality to check if a future is timed out, and allows
specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out.
ghstack-source-id: 93192622
Test Plan: Added unit tests.
Differential Revision: D18025163
fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28909
This allows to chain calls on RRef as exemplified in the new test case added.
ghstack-source-id: 92996018
Test Plan: unit test.
Differential Revision: D18231081
fbshipit-source-id: deeac044ef6d63f18ea241760ac17a3e644cb3d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28934
These tests are flaky, skip them as we investigate for a root cause
ghstack-source-id: 92945898
Test Plan: tests pass
Differential Revision: D18235766
fbshipit-source-id: 9bff65653954b767e32bcc1d25c65b0cea2c4331
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377
Test Plan: Added unit tests to test failures.
Differential Revision: D17916844
fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28606
Without passing setup_model_parallel=True to dist_init, it the
decorator actually takes function object as the value for the
flag.
Test Plan: Imported from OSS
Differential Revision: D18120507
Pulled By: mrshenli
fbshipit-source-id: afbaa381647e8f284e28fa9dbdd2a7c411073b3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226
# Goal
Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.
The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.
We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.
# Solution
- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.
Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```
```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```
```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```
Differential Revision: D5524494
fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28025
Add a PyFuture type which is wrapper of either an OwnerRRef or a
jit::Future. The difference between PyFuture and jit::Future is that
PyFuture can return an custom py::object type.
Test Plan: Imported from OSS
Differential Revision: D17936746
Pulled By: mrshenli
fbshipit-source-id: a7451af3993d98aeab462ffd5318fc6d28f915c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27776
I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.
The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.
I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.
Differential Revision: D5445858
fbshipit-source-id: 56ee24703abd8c5b366829430bef657e0f1dfeba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28013
ProcessGroupAgent currently kicks off the listener thread in its
constructor. However, serving requests requires contexts to be
initialized, e.g., RRefContext and agent_ global var in api.py,
which might not be done yet when the first request arrives.
ProcessGroupAgent does not know what would be the appropriate time
to start the listener thread, hence exposing an API for higher
layer code to explicitly start listeners.
Test Plan: Imported from OSS
Differential Revision: D17932271
Pulled By: mrshenli
fbshipit-source-id: 3b408477594d4d19319e7cd08dd6f383a7ed7670
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27761
# Problem
`rpc_test` currently only has test cases that put equal amount of work on every worker node.
The problem is that even if the `RpcAgent::sync` is implemented as an empty method. There is no termination misbehavior detected.
# Solution
At least add one imbalanced-loaded test.
ghstack-source-id: 91785984
Differential Revision: D5361435
fbshipit-source-id: 92d1f7cad61b27cdeadc2825ceab6e88d5e4b459