Commit Graph

67 Commits

Author SHA1 Message Date
Rohan Varma
1350b99de4 Add local shutdown to process group agent (#30330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30330

This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately.

ghstack-source-id: 94673884
ghstack-source-id: 94673884

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18661775

fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2
2019-11-27 22:34:08 -08:00
Shen Li
efe1859ad9 By default ignore RRef leaks during shutdown (#30217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30217

Before this commit, RRefContext throws an error if it detects any
RRef leak during shutdown. However, this requires applications to
make sure that is has freed all references to RRefs in application
code, which can be a bad debugging experience when for large
applications. Besides, this also relies on Python GC to free things
up in time, which might not always be true. After this commit,
RRefContext would ignore leaking RRefs during shutdown, as shutdown
is called when the application has finished training and no longer
care about local states. Hence, it should be OK to just ignore
those leaks and destroy OwnerRRefs. If application would like to
enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak
to False.

Test Plan: Imported from OSS

Differential Revision: D18632546

Pulled By: mrshenli

fbshipit-source-id: 2744b2401dafdd16de0e0a76cf8e07777bed0f38
2019-11-26 06:53:58 -08:00
Rohan Varma
5c6705e62c add default arg for init_method (#30208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208

Adds default arg for init_method so users don't have to pass this in,
and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs.
ghstack-source-id: 94500475

Test Plan: Unit tests pass.

Reviewed By: mrshenli

Differential Revision: D18630074

fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a
2019-11-25 14:52:48 -08:00
Shihao Xu
6a00191fc2 Add RpcAgent::getWorkerInfos() (#30241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30241

We need an API to get all worker infos. This will be used by backend-agnostic `rpc.wait_all_workers()` API.
ghstack-source-id: 94454935

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_get_worker_infos

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_get_worker_infos
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_get_worker_infos

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_get_worker_infos
```

Differential Revision: D5693412

fbshipit-source-id: 5123c8248b6d44fd36b8a5f381dbabb2660e6f0f
2019-11-22 18:26:30 -08:00
Shen Li
a9f3f48f88 Revert D5578006: Add local shutdown to process group agent
Test Plan: revert-hammer

Differential Revision:
D5578006

Original commit changeset: 6258879fb44c

fbshipit-source-id: 11b893b3a280a8383eeb20a0548626811616dca1
2019-11-22 11:31:04 -08:00
Rohan Varma
c478a92b93 Add local shutdown to process group agent (#30020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30020
This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times.

ghstack-source-id: 94415336

Test Plan: Unit tests pass.

Differential Revision: D5578006

fbshipit-source-id: 6258879fb44c9fca97fdfad64468c1488c16ac02
2019-11-22 10:03:00 -08:00
Rohan Varma
f41422121e default construct rpc agent options based on the backend type (#30201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201

Provide a default constructor so that users don't have to construct
RPC agent options. Also rename this to RPCBackend Options as suggested.
ghstack-source-id: 94411768

Test Plan: Unit tests pass.

Differential Revision: D18628698

fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446
2019-11-22 08:18:06 -08:00
Shen Li
4609c626c5 Enable test_call_method_on_rref in rpc_test (#30261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30261

With #29827, the flakiness should disappear for test_call_method_on_rref

Test Plan: Imported from OSS

Differential Revision: D18645036

Pulled By: mrshenli

fbshipit-source-id: 44d759062fc78b1a797266096dbb4ddd104f07eb
2019-11-21 19:38:19 -08:00
Wen Zhang
6e4c23b02f Add RPC internal helper that overrides the default pickler. (#30185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30185

To enable share_memory over RPC, add an internal helper that overrides the default RPC pickler.
Replace D18598974
ghstack-source-id: 94299660

Test Plan:
`python test/test_rpc_spawn RpcTestWithSpawn.test_use_rpc_pickler`

`buck test mode/dev-nosan //caffe2/test:rpc_spawn -- test_use_rpc_pickler`

Reviewed By: mrshenli

Differential Revision: D18621372

fbshipit-source-id: c680ef711b2c42524c47a5266e911fa8e0cd45ae
2019-11-21 10:01:02 -08:00
Rohan Varma
f304bd5062 rename join_rpc to wait_all_workers in public api (#30050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30050

Renames this API to wait_all_workers as discussed.
ghstack-source-id: 94273005

Test Plan: Unit tests pass

Differential Revision: D18581466

fbshipit-source-id: 4ff5d5fb2d528f17252d5b5f30c3047d2efb92bf
2019-11-20 12:38:35 -08:00
Yanli Zhao
b410d864c9 make python remote exception to rethrow when using remote reference to itself (#29930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29930
Right now, python call remote exception rethrown is coupled with deserializtiaon.
For owner ref, the setValue() and getValue() do not use serialization and deserialization, so when users create a ref to itself, and call ownerRef.to_here(), python call remote exception will not be rethrown.

This diff is to move remote exception rethrown out of deserialization, and exception can be handled for ownerRef.localValue() or ownerRef.to_here()

close #29924
ghstack-source-id: 94210894

Test Plan: unit tests

Differential Revision: D18541916

fbshipit-source-id: 7cda93f623d52c740b3c1b1fa9a442f866984340
2019-11-19 21:33:21 -08:00
Shihao Xu
80e3f17301 Resubmit "Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents" (#30093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093

https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295

Test Plan:
### OSS RPC + RRef tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```

### Prototype RRef tests

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```

### Dist autograd

```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```

Differential Revision: D18595578

fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
2019-11-19 18:52:30 -08:00
Shihao Xu
868cb05a30 Resubmit "Add RpcAgentTestFixture to extract duplicate code" (#30092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30092

There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.
ghstack-source-id: 94196891

Test Plan:
### RPC + RRef

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck test mode/dev-nosan //caffe2/test:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```

### Dist Autograd

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```

### Dist Optimizer

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```

Differential Revision: D18595408

fbshipit-source-id: 8360759c63e838fb19d4eb1aeacca0bf8eb4b55f
2019-11-19 16:24:51 -08:00
Shen Li
5aa50c7f3c Enable test_nested_rref in rpc_test.py (#30100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30100

As after #29827 we only test RPC using spawn, the multi-thread/fork
error should disappear.

Test Plan: Imported from OSS

Differential Revision: D18597002

Pulled By: mrshenli

fbshipit-source-id: 64aa6a59248e5d1b7e1ad1aebffb6a25248388d2
2019-11-19 13:28:05 -08:00
Shen Li
a243e0872e Enable test_nested_remote in rpc_test.py (#30099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30099

As after #29827 we only test RPC using spawn, the multi-thread/fork
error should disappear.

Test Plan: Imported from OSS

Differential Revision: D18597003

Pulled By: mrshenli

fbshipit-source-id: ebfb1f6f3f961d98351e06ce4b951793a9b95398
2019-11-19 13:28:01 -08:00
Shen Li
8912e6caf5 Enable test_nested_rpc in rpc_test.py (#30098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30098

As after #29827 we only test RPC using spawn, the multi-thread/fork
error should disappear.

Test Plan: Imported from OSS

Differential Revision: D18597001

Pulled By: mrshenli

fbshipit-source-id: 68256289085fac1a9ca76d5b4882e97e2f81d1f4
2019-11-19 13:27:57 -08:00
Edward Yang
7d287688eb Revert D5689636: Add RpcAgentTestFixture to extract duplicate code
Test Plan: revert-hammer

Differential Revision:
D5689636

Original commit changeset: f35eea1359ad

fbshipit-source-id: 31928fce5e96b3beceefbc9a03f54769f10b7e1a
2019-11-19 08:14:44 -08:00
Edward Yang
1dda8186ae Revert D18549919: Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents
Test Plan: revert-hammer

Differential Revision:
D18549919

Original commit changeset: b9f3f1a41d1f

fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee
2019-11-19 08:14:40 -08:00
Rohan Varma
83513506c3 poll for timed out futures in process group agent (#29601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29601

Follow up from https://github.com/pytorch/pytorch/pull/28392. Adds a background thread to `ProcessGroupAgent` that polls for timed out RPCs at a pre-set interval, and marks them as completed with a timeout exception if they have timed out. Also deletes the futures from the corresponding maps `futures_` and `futureTimeouts`. Unit tests are added to ensure that timed out RPCs are appropriately cleaned up.

Also adds a `shutdown` variable to process group agent to control the shutting down of this background thread, which can eventually be extended to use for controlling a clean shutdown of process group agent.
ghstack-source-id: 94175131

Test Plan: Added unit tests

Differential Revision: D18434215

fbshipit-source-id: c48abdb8759fe1447200ec66bb9d4b1c50ec4535
2019-11-19 06:42:04 -08:00
Shihao Xu
21dc1d4543 Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents (#29972)
Summary:
https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.

closes https://github.com/pytorch/pytorch/issues/29031
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972

Differential Revision: D18549919

Pulled By: xush6528

fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b
2019-11-19 01:00:08 -08:00
Alisson Gusatti Azzolini
97156f548d Add hash and equality operators for WorkerInfo (#29958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29958

DistributedOptimizer relies on hashing WorkerInfo in order to coalesce fan-out RPCs. This will likely be a very common use case (EASGD will do the same, for example).
ghstack-source-id: 94169198

Test Plan: unit test.

Differential Revision: D18548257

fbshipit-source-id: 7d67d4e1b9bc60403c372164982a75ae8c1d8389
2019-11-18 20:47:13 -08:00
Pritam Damania
df6a1c0437 Remove rpc.sync_rpc from the public API. (#30033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30033

Removing this API for now since we don't have a concrete use-case for
this yet and as a result exposing this as a public API might result in users
depending on this API.

We can always add some variant of this API back if needed later.
ghstack-source-id: 94138302

Test Plan: waitforbuildbot

Differential Revision: D18578056

fbshipit-source-id: 078c62331725e03bd5702624afc16b1cdcdf26a4
2019-11-18 18:02:07 -08:00
Shihao Xu
8dd67057f1 Add RpcAgentTestFixture to extract duplicate code (#29747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29747

There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class.

Test Plan:
### RPC + RRef

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck test mode/dev-nosan //caffe2/test:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift
```

### Dist Autograd

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift
```

### Dist Optimizer

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn
```

```
buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift

buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift
```

Differential Revision: D5689636

fbshipit-source-id: f35eea1359addaaac9bd8d00d0a5df228a236511
2019-11-18 12:54:17 -08:00
Rohan Varma
639133d6d1 rename init_model_parallel to init_rpc (#29762)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762

Rename this API as discussed, since it's use cases extend beyond only
model parallelism.
ghstack-source-id: 94020627

Test Plan: Unit tests pass

Differential Revision: D18491743

fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57
2019-11-18 06:07:44 -08:00
Shen Li
4a1fcc0b83 Allow rpc.remote to create RRef on self (#29634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29634

This implementation supports rpc.remote to self by doing the
following steps:

1. create an owner RRef
2. add the owner RRef to owners_ in RRefContext, and keep it alive
   by using RRefId as the ForkId.
3. Go through serde and insert the message to the caller's thread-pool
4. When the response message gets processed, remove the itself from
   RRef fork map.

Test Plan: Imported from OSS

Differential Revision: D18445812

Pulled By: mrshenli

fbshipit-source-id: e3b9aa98962c388acbc2ce294101a236d5cb2da6
2019-11-14 00:10:24 -08:00
Shen Li
c49b324cbf Enable test_stress_light_rpc in rpc_test.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29473

Test Plan: Imported from OSS

Differential Revision: D18404820

Pulled By: mrshenli

fbshipit-source-id: de0f18db208d83794507c162483bb948056af533
2019-11-11 12:22:10 -08:00
Shen Li
bb90c18791 Enable test_py_rref_args_user_share in rpc_test.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29472

Test Plan: Imported from OSS

Differential Revision: D18404818

Pulled By: mrshenli

fbshipit-source-id: 1fcd19b178dc20540a210601cbb2c974be14a7cc
2019-11-11 12:22:05 -08:00
Shen Li
b885eff4be Enable test_multi_py_udf_remote in rpc_test.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29471

Test Plan: Imported from OSS

Differential Revision: D18404819

Pulled By: mrshenli

fbshipit-source-id: 8cf3e32d7980e34c48bfd8fb61cfd9a0acc9bd46
2019-11-11 12:22:01 -08:00
Shen Li
bc4457f5b6 Enable test_py_built_in in rpc_test.py
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29470

Test Plan: Imported from OSS

Differential Revision: D18404822

Pulled By: mrshenli

fbshipit-source-id: 01cb87dee39c3579a2e0961d67b627ca1dc87fc2
2019-11-11 12:21:56 -08:00
Alisson Gusatti Azzolini
93b5c9d723 Allow to create local RRef with value (#28948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28948

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 93565010

Test Plan: unit test.

Differential Revision: D18241227

fbshipit-source-id: b9e9b958f40623348d62ee6fc9e7f0414b4215b7
2019-11-11 12:19:45 -08:00
Shen Li
3e5af22650 Disable flaky RPC tests (#29485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29485

The flakiness is likely due to the problem with OMP and fork. We
should disable fork tests for good, but that would have negative
impact on internal test coverage. This commit disables the most
buggy nested tests for now, until we find a way to turn fork test
off.

Test Plan: Imported from OSS

Differential Revision: D18407529

Pulled By: mrshenli

fbshipit-source-id: dcbe49a9d104fcf1eaf83107d58904d49dc18aff
2019-11-10 21:33:27 -08:00
Jeremy Lilley
2cd4f86422 Support process_group_agent "sending to itself" (#29253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29253

Some operations can be simpler if a worker can send an rpc to itself.
The main reason for not doing previous was that Gloo doesn't support
self-sending.

That said, this changes the process_group_agent to skip the assert
check, and simply enqueue the rpc message in its receiving queue.
ghstack-source-id: 93518076

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D18339715

fbshipit-source-id: 08ade40e81da378b003a550c898a726e99d50e34
2019-11-08 12:11:55 -08:00
Pritam Damania
5e1983f90f Fix distributed autograd initialization. (#29069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29069

Distributed autograd was initialized after RPC and this would cause a
race in some scenarios where one node might have initialized distributed
autograd, calls backward() but other nodes have not initialized distributed
autograd yet.

Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly
has a sync between processes via the store.
ghstack-source-id: 93535922

Test Plan: waitforbuildbot

Differential Revision: D18280875

fbshipit-source-id: 739a1c22dec21df859738d074e6e497fa43257fd
2019-11-08 11:20:15 -08:00
Shen Li
63675b1969 Revert RRef.to_here()/local_value() return type (#29396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29396

The return types of RRef.to_here()/local_value() were recently
changed to Future, which triggers flakiness as the RRef could be
deleted before the future.wait() finishes. While we are still
discussing how we'd like to solve it, this commit reverts the
return type to stop bleeding in tests.

closes #28885

Test Plan: Imported from OSS

Differential Revision: D18375571

Pulled By: mrshenli

fbshipit-source-id: 354dbf38b15ab804e44fc9968dd30888415c1fab
2019-11-08 08:31:18 -08:00
Shihao Xu
e66626ae5c Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. (#29341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341

So that other RpcAgent could use this timeout setting as well.

ghstack-source-id: 93481902

Differential Revision: D5681951

fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc
2019-11-07 17:05:45 -08:00
Rohan Varma
003cb8595b skip more flaky rpc tests (#29157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29157

As reported, these tests are flaky and time out. Skip them
while we investigate further.
ghstack-source-id: 93287663

Test Plan: CI

Differential Revision: D18309204

fbshipit-source-id: 95f0ea5e0c1162b78da412a34db446a01dfc33bf
2019-11-05 15:49:13 -08:00
Shihao Xu
ac027d30d5 Half test time, test_asymmetric_load_with_join, to avoid flakiness (#29139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29139

Each test has 100 sec timeout.

Current this test takes 90~110 secs to finish, causing flakiness.

Half the load to make it not on the edge of timeout.
ghstack-source-id: 93203670

Differential Revision: D5644012

fbshipit-source-id: 2a85999cf1ae6d18e9a871cd76ce194e1ce7b3e8
2019-11-05 14:54:19 -08:00
Rohan Varma
fd0f9811ad add timeout for RPC futures, and ability to set timeout when initializing rpc (#28392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392

Per #25531, we want to clean up futures when we detect that there are
failures/timeouts. As a first step, this diff adds timers to the future object,
provides functionality to check if a future is timed out, and allows
specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out.
ghstack-source-id: 93192622

Test Plan: Added unit tests.

Differential Revision: D18025163

fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33
2019-11-04 14:43:03 -08:00
Alisson Gusatti Azzolini
d3cd64d71d PyRRef.owner() to return WorkerInfo (#28909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28909

This allows to chain calls on RRef as exemplified in the new test case added.
ghstack-source-id: 92996018

Test Plan: unit test.

Differential Revision: D18231081

fbshipit-source-id: deeac044ef6d63f18ea241760ac17a3e644cb3d7
2019-10-31 17:11:24 -07:00
Rohan Varma
05e88dc4fe skip additional flaky rpc tests (#28934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28934

These tests are flaky, skip them as we investigate for a root cause
ghstack-source-id: 92945898

Test Plan: tests pass

Differential Revision: D18235766

fbshipit-source-id: 9bff65653954b767e32bcc1d25c65b0cea2c4331
2019-10-31 10:12:59 -07:00
Shihao Xu
8f1564b8ab Add enum type to rpc registry for consolidating RPC initialization code path (#28628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28628

Consolidate code paths of ProcessGroupAgent construction and other RPC Backend construction.
ghstack-source-id: 92845348

Differential Revision: D5516188

fbshipit-source-id: 151d9b7b74f68631d6673fecc74dec525949b8f0
2019-10-29 17:26:15 -07:00
Pritam Damania
1322daa506 Improve error handling for distributed autograd engine. (#27940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377

Test Plan: Added unit tests to test failures.

Differential Revision: D17916844

fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
2019-10-25 12:07:27 -07:00
Shen Li
261a13a84b Enable dist autograd tests (#28606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28606

Without passing setup_model_parallel=True to dist_init, it the
decorator actually takes function object as the value for the
flag.

Test Plan: Imported from OSS

Differential Revision: D18120507

Pulled By: mrshenli

fbshipit-source-id: afbaa381647e8f284e28fa9dbdd2a7c411073b3f
2019-10-24 15:30:27 -07:00
Shihao Xu
59402f51cf Make init_method url appending step re-usable by both init_process_group and init_model_parallel(init_rpc) (#28226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226

# Goal

Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.

The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.

We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.

# Solution

- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.

Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```

```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```

```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```

Differential Revision: D5524494

fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
2019-10-23 21:51:08 -07:00
Shen Li
e31adeb4f3 Make RRef::LocalValue return Future (#28025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28025

Add a PyFuture type which is wrapper of either an OwnerRRef or a
jit::Future. The difference between PyFuture and jit::Future is that
PyFuture can return an custom py::object type.

Test Plan: Imported from OSS

Differential Revision: D17936746

Pulled By: mrshenli

fbshipit-source-id: a7451af3993d98aeab462ffd5318fc6d28f915c8
2019-10-23 17:07:16 -07:00
Shen Li
0ddb50010e enable test_invalid_names test in rpc_test
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28376

Test Plan: Imported from OSS

Differential Revision: D18045158

Pulled By: mrshenli

fbshipit-source-id: 42821ef40afbdff8662abacd447e307ccf4853d3
2019-10-21 18:43:37 -07:00
Shihao Xu
3523e5427a Add master to OSS RPC test (#27776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27776

I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging.

The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py.

I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability.

Differential Revision: D5445858

fbshipit-source-id: 56ee24703abd8c5b366829430bef657e0f1dfeba
2019-10-16 13:45:45 -07:00
Shen Li
59cd0faeff Defer pg agent listener thread until contexts are initialized (#28013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28013

ProcessGroupAgent currently kicks off the listener thread in its
constructor. However, serving requests requires contexts to be
initialized, e.g., RRefContext and agent_ global var in api.py,
which might not be done yet when the first request arrives.
ProcessGroupAgent does not know what would be the appropriate time
to start the listener thread, hence exposing an API for higher
layer code to explicitly start listeners.

Test Plan: Imported from OSS

Differential Revision: D17932271

Pulled By: mrshenli

fbshipit-source-id: 3b408477594d4d19319e7cd08dd6f383a7ed7670
2019-10-15 17:45:43 -07:00
Shihao Xu
871b1419de Test graceful termination of RPCAgent with asymmetric load (#27761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27761

# Problem

`rpc_test` currently only has test cases that put equal amount of work on every worker node.
The problem is that even if the `RpcAgent::sync` is implemented as an empty method. There is no termination misbehavior detected.

# Solution

At least add one imbalanced-loaded test.
ghstack-source-id: 91785984

Differential Revision: D5361435

fbshipit-source-id: 92d1f7cad61b27cdeadc2825ceab6e88d5e4b459
2019-10-15 16:45:21 -07:00
Shen Li
f10ea7a2e1 Add test for requires_process_group_agent decorator
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27879

Test Plan: Imported from OSS

Differential Revision: D17924096

Pulled By: mrshenli

fbshipit-source-id: 91aaad12daf985768dfb05fb9630cee21a81a366
2019-10-15 06:57:34 -07:00