Commit Graph

30 Commits

Author SHA1 Message Date
Pritam Damania
c06f9023e5 Polish rpc docstring. (#30069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30069

1) Fix rpc docstrings
2) Fix some links
ghstack-source-id: 94250890

Test Plan: waitforbuildbot

Differential Revision: D18588231

fbshipit-source-id: 33846ace1afa94d25f34b0370437abf6d9408f06
2019-11-19 23:10:14 -08:00
Shihao Xu
80e3f17301 Resubmit "Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents" (#30093)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093

https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.
ghstack-source-id: 94197295

Test Plan:
### OSS RPC + RRef tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc
```

### Prototype RRef tests

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent
```

### Dist autograd

```
buck test mode/dev-nosan caffe2/test:dist_autograd_fork
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test
```

Differential Revision: D18595578

fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065
2019-11-19 18:52:30 -08:00
Edward Yang
1dda8186ae Revert D18549919: Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents
Test Plan: revert-hammer

Differential Revision:
D18549919

Original commit changeset: b9f3f1a41d1f

fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee
2019-11-19 08:14:40 -08:00
Rohan Varma
83513506c3 poll for timed out futures in process group agent (#29601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29601

Follow up from https://github.com/pytorch/pytorch/pull/28392. Adds a background thread to `ProcessGroupAgent` that polls for timed out RPCs at a pre-set interval, and marks them as completed with a timeout exception if they have timed out. Also deletes the futures from the corresponding maps `futures_` and `futureTimeouts`. Unit tests are added to ensure that timed out RPCs are appropriately cleaned up.

Also adds a `shutdown` variable to process group agent to control the shutting down of this background thread, which can eventually be extended to use for controlling a clean shutdown of process group agent.
ghstack-source-id: 94175131

Test Plan: Added unit tests

Differential Revision: D18434215

fbshipit-source-id: c48abdb8759fe1447200ec66bb9d4b1c50ec4535
2019-11-19 06:42:04 -08:00
Shihao Xu
21dc1d4543 Add RpcAgentOptions struct type, which bundles different required arguments for different RpcAgents (#29972)
Summary:
https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031.

To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields.

closes https://github.com/pytorch/pytorch/issues/29031
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972

Differential Revision: D18549919

Pulled By: xush6528

fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b
2019-11-19 01:00:08 -08:00
Alisson Gusatti Azzolini
97156f548d Add hash and equality operators for WorkerInfo (#29958)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29958

DistributedOptimizer relies on hashing WorkerInfo in order to coalesce fan-out RPCs. This will likely be a very common use case (EASGD will do the same, for example).
ghstack-source-id: 94169198

Test Plan: unit test.

Differential Revision: D18548257

fbshipit-source-id: 7d67d4e1b9bc60403c372164982a75ae8c1d8389
2019-11-18 20:47:13 -08:00
Rohan Varma
371da6acef move get_rpc_timeout to pybind (#29765)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29765

instead of wrapping this C++ function with python that causes
unnecessary overhead, we can move this to pybind and use the `DefaultRpcAgent`
to get the timeout.
ghstack-source-id: 93879236

Test Plan: unit tests pass

Differential Revision: D18493195

fbshipit-source-id: fd0f1f13ee15acb5ea1ae7c696925c9b54304f6d
2019-11-14 19:39:22 -08:00
Rohan Varma
06ef4a757d Add docs for RPC, dist autograd, and RRef modules (#29276)
Summary:
Closes https://github.com/pytorch/pytorch/issues/28983. Documentation for `torch.distributed.rpc` and `torch.distributed.autograd` modules. Also fixes/tidies up some of the docstrings in rpc/autograd, and moves some functions to be private so they don't show up in the documentation.

Note: Much of the text to describe/explain the RPC/RRef layers are taken from the following RFCs: https://github.com/pytorch/pytorch/issues/23110, https://github.com/pytorch/pytorch/issues/26759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29276

Differential Revision: D18478754

Pulled By: rohan-varma

fbshipit-source-id: e9a7089baf5275304e5408d319eb9bf98e53fff8
2019-11-14 14:32:03 -08:00
Alisson Gusatti Azzolini
93b5c9d723 Allow to create local RRef with value (#28948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28948

Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users.
This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module.
ghstack-source-id: 93565010

Test Plan: unit test.

Differential Revision: D18241227

fbshipit-source-id: b9e9b958f40623348d62ee6fc9e7f0414b4215b7
2019-11-11 12:19:45 -08:00
Shen Li
63675b1969 Revert RRef.to_here()/local_value() return type (#29396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29396

The return types of RRef.to_here()/local_value() were recently
changed to Future, which triggers flakiness as the RRef could be
deleted before the future.wait() finishes. While we are still
discussing how we'd like to solve it, this commit reverts the
return type to stop bleeding in tests.

closes #28885

Test Plan: Imported from OSS

Differential Revision: D18375571

Pulled By: mrshenli

fbshipit-source-id: 354dbf38b15ab804e44fc9968dd30888415c1fab
2019-11-08 08:31:18 -08:00
Shihao Xu
e66626ae5c Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. (#29341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341

So that other RpcAgent could use this timeout setting as well.

ghstack-source-id: 93481902

Differential Revision: D5681951

fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc
2019-11-07 17:05:45 -08:00
Pieter Noordhuis
b4df413712 Scope pybind11 functions to torch.distributed.{autograd,rpc}
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27529

Test Plan: Imported from OSS

Differential Revision: D17808209

Pulled By: pietern

fbshipit-source-id: 1e3e086085167320c3fc369467f5d75ce39fa4ea
2019-11-05 06:25:22 -08:00
Rohan Varma
fd0f9811ad add timeout for RPC futures, and ability to set timeout when initializing rpc (#28392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392

Per #25531, we want to clean up futures when we detect that there are
failures/timeouts. As a first step, this diff adds timers to the future object,
provides functionality to check if a future is timed out, and allows
specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out.
ghstack-source-id: 93192622

Test Plan: Added unit tests.

Differential Revision: D18025163

fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33
2019-11-04 14:43:03 -08:00
Shen Li
e31adeb4f3 Make RRef::LocalValue return Future (#28025)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28025

Add a PyFuture type which is wrapper of either an OwnerRRef or a
jit::Future. The difference between PyFuture and jit::Future is that
PyFuture can return an custom py::object type.

Test Plan: Imported from OSS

Differential Revision: D17936746

Pulled By: mrshenli

fbshipit-source-id: a7451af3993d98aeab462ffd5318fc6d28f915c8
2019-10-23 17:07:16 -07:00
Yanli Zhao
3214f134b6 fix python rpc handler exit crash (#27251)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27251

 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit.

See similar issues reported https://github.com/pybind/pybind11/issues/1598
and https://github.com/pybind/pybind11/issues/1493.

Our local tests also caught this segment faults if py::objects are cleaned
up at program exit. The explaination is: CPython cleans up most critical
utitlies before cleaning up PythonRpcHandler singleton, so when
PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it
will crash.

The solution is to clean up py::objects earlier when Rpc agent join().
Be note that py::objects can not be cleaned up when Rpc agent is destroyed
as well, as Rpc agent is global variable and it will have same issue as
PythonRpcHandler.

close #27182
ghstack-source-id: 92035069

Test Plan: unit tests on python 3.6 and python 3.5

Differential Revision: D17727362

fbshipit-source-id: c254023f6a85acce35528ba756a4efabba9a519f
2019-10-16 16:57:38 -07:00
Shen Li
59cd0faeff Defer pg agent listener thread until contexts are initialized (#28013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28013

ProcessGroupAgent currently kicks off the listener thread in its
constructor. However, serving requests requires contexts to be
initialized, e.g., RRefContext and agent_ global var in api.py,
which might not be done yet when the first request arrives.
ProcessGroupAgent does not know what would be the appropriate time
to start the listener thread, hence exposing an API for higher
layer code to explicitly start listeners.

Test Plan: Imported from OSS

Differential Revision: D17932271

Pulled By: mrshenli

fbshipit-source-id: 3b408477594d4d19319e7cd08dd6f383a7ed7670
2019-10-15 17:45:43 -07:00
Pritam Damania
3bccd3fc0d Distributed Autograd - FAST mode backward pass implementation. (#27022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022

This change implements the "FAST" mode distributed autograd backward
pass as described in https://github.com/pytorch/pytorch/issues/23110.

At a high level the backward pass works as follows:
1. We start by computing dependencies on the node that calls
`torch.distributed.backward`.
2. This node computes the dependencies starting from the root nodes provided in
the backward call and all the 'send' functions present in the current autograd
context. The "FAST" mode assumes all 'send' functions are part of the autograd
computation.
3. Once the dependency computation is done, the distributed autograd engine
calls the local autograd engine to execute the autograd graph. Note that the
autograd graph on a single node is not necessarily connected because of
inter-node communication. As a result, we have special handling to ensure the
local autograd engine ensures we execute the entire graph starting from the
provided roots and all 'send' functions on the node.
4. When the local autograd engine hits a 'recv' function, it performs an async
RPC to send the gradients over to the appropriate node and stores a future in
the autograd context to keep track of this RPC.
5. On the destination node, the appropriate 'send' function is looked up and
enqueued on the local autograd engine. If this is the first time the node is
hearing about this autograd context id on the backward pass, then the node
computes dependencies for the local autograd engine.
6. As part of compute dependencies, the distributed autograd engine discovers
all leaf nodes and ensures those are passed as 'outputs' to the local autograd
engine. This avoids running the 'AccumulateGrad' function.
7. The gradients computed for the leaf nodes are then actually accumulated in
`DistAutogradContext` for the appropriate autograd context id.
8. The distributed autograd engine waits for the local autograd engine
to complete and also waits for all the 'Futures' (stored in 4.) for respective
RPCs to finish.

We have made the following changes to the local autograd engine for this
purpose:

1. Expose GraphTask and NodeTask so that the distributed autograd engine can
use them.
2. Expose a `execute_with_graph_task` API which gives the distributed engine
to build a GraphTask and pass it to the local autograd engine.
3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build
a `NodeTask` for a 'send' function and enqueue it on the local autograd engine.

In addition to this a few general improvements:
1. Added a `PropagateGradients` RPC call for the 'recv' function to pass
gradients to the appropriate node during the backward pass.
2. Use IValues as much as possible in serialization for RpcWithAutograd.
3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate
exception instead of just returning the message. This is inline with what most
Future.wait() APIs do.
4. Added a `get_gradients(context_id)` API which allows users to retrieve a map
from Tensor to respective gradient for the provided context_id on the local
node.
ghstack-source-id: 91794926

Test Plan: unit tests.

Differential Revision: D17652615

fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3
2019-10-12 09:47:49 -07:00
Shen Li
2486b0ba82 Add Python RRef as args and return value (#25499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499

See #23110 for model parallel design details, and #26759 for the RRef
protocol. This commit add support for using RRef as Python UDF arguments
and return value. RRefs can now be shared from owner to user, from user to
owner, or from user to user.

Limitations:
1. No implicit type conversion yet. (#27099)
2. No failure handling and retry. (#26116)
3. UDF is not yet blocked until all RRefs are confirmed. (#27098)
4. Internal RRef control messages are not idempotent yet. (#26116)
5. Cannot delete RRefs correctly when there are circular dependencies. (#27096)

Main changes:

1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations.
2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages.
3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`.
4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure.
5.  Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs.
6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`.

Test Plan:
Imported from OSS

buck test mode/dev-nosan //caffe2/test:rpc_fork

Differential Revision: D17184146

Pulled By: mrshenli

fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265
2019-10-03 17:47:12 -07:00
Pritam Damania
fe4170bda8 Add send and recv backward functions for builtin operators RPC. (#25527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527

Master GH issue: https://github.com/pytorch/pytorch/issues/23110.

This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466

Test Plan: unit tests.

Differential Revision: D17148077

fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233
2019-10-03 01:18:46 -07:00
Yanli Zhao
631e2ee7a4 make python udf serialization format to be binary plus tensor tables (#27136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136

make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators
ghstack-source-id: 91156141

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D17405686

fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd
2019-10-02 00:10:32 -07:00
Rohan Varma
1f51051287 remove extra get_worker_id call in distributed rpc init (#26381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26381

Was looking through this definition and saw that it has 2 identical
definitions of get_worker_id. Tested by ensuring that all tests in
`test/test_rpc.py` still pass.
ghstack-source-id: 90347452

Test Plan: See above

Differential Revision: D17439495

fbshipit-source-id: 9a78340f7aefa5797e0ae837fbcfe24ebe3a775d
2019-09-18 16:34:54 -07:00
Shen Li
197fd4f707 Adding RRef as return value for builtin operators (#25169)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25169

See #23110 for RRef design details. This commit only implements
RRef as return value for builtin operators, and RRef will communicate
between a user and the owner. More specifically, a RRef is first
created on the `dist.remote` caller, which is a user of the RRef.
Then the RRef user sends and notification to the owner to report
the fork to the owner, and the owner uses a shared_ptr to keep
the RRef alive. When the user RRef is destructed on the caller,
another notification will be sent to the owner, and the owner
can then drop it's RRef as well.

Test Plan: Imported from OSS

Differential Revision: D17048343

Pulled By: mrshenli

fbshipit-source-id: 9dd3b3d0e4fd214c76fecdbed746a6d3029b3efd
2019-09-05 15:14:17 -07:00
Pieter Noordhuis
5407241b4f Run clang-format on torch/csrc/distributed (#25647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25647

TSIA

Test Plan: N/A

Differential Revision: D17182909

fbshipit-source-id: 22a6554693def0032a051cef5fe788f49de1d740
2019-09-04 10:08:09 -07:00
Shen Li
c881136215 Move worker name collection code from Python to C++ (#24260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24260

This also simplifies ProcessGroupAgent constructor signature.

Test Plan: Imported from OSS

Differential Revision: D16789219

Pulled By: mrshenli

fbshipit-source-id: bbb69022435467fbb1c28da21dd03d3ab52fc521
2019-08-31 19:02:45 -07:00
Shen Li
1294e55c15 Assign each RpcAgent a unique ID, and use ID for sending RPC messages. (#24195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24195

It is not efficient to use a string destination name in every
send. Moreover, when we add RRef later, RpcAgent will frequently check
RRef ownership. It will be slow as well if we have to go though string
comparison every time. This commit assigns each RpcAgent a unique
integer ID. In the Python send API, applications can provide either
destination name or id. If it is a string name, it will be converted to
id by calling the get_id(workerName) API.

Test Plan: Imported from OSS

Differential Revision: D16770241

Pulled By: mrshenli

fbshipit-source-id: fa56128a77a02a402dc6682474bc301dc1b7f43d
2019-08-29 19:19:11 -07:00
Pritam Damania
7818e7e5d4 Basic framework for Distributed Autograd context. (#24875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24875

As per https://github.com/pytorch/pytorch/issues/23110, each autograd pass
would be assigned a unique autograd_context_id. In this change we introduce a
DistAutogradContainer per worker which holds information for each autograd pass
currently running.

DistAutogradContainer has a map from the autograd_context_id to
DistAutogradContext (which holds all the relevant information for the autograd
pass). DistAutogradContext currently only stores the autograd_context_id and
more information would be added to it later as we build out the rest of the
framework.

The autograd_context_id is a 64 bit globally unique integer where the first 16
bits are the worker_id and next 48 bits are auto-incrementing for uniqueness.

Sample python code on how this would be used for distributed autograd:

```
import torch.distributed.autograd as dist_autograd
worker_id = 0
dist_autograd.init(worker_id)
with dist_autograd.context() as context_id:
     # forward pass...
     # backward pass...
     # optimizer step...
```
ghstack-source-id: 89119248

Test Plan: unit tests.

Differential Revision: D16356694

fbshipit-source-id: d1a8678da0c2af611758dbb5d624d554212330ce
2019-08-28 18:51:56 -07:00
Shen Li
b6803d62fd Use snake names for all files in distributed.rpc (#24502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24502

Files in distributed.rpc package mixes snake camel names. This
commit cleans that up and all files use snake names now.
ghstack-source-id: 88548990

Reviewed By: xush6528

Differential Revision: D16860155

fbshipit-source-id: 3a22a89bf6c4e11aac5849564fc53296a04d6a8b
2019-08-19 10:58:59 -07:00
Shen Li
99dea08e60 Use c10::ThreadPool to send and receive messages (#23968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23968

Existing ProcessGroupAgent uses a single thread to send all messages, and
a single thread to listen and process all received messages. This causes
both performance issues and also prevents nested RPCs. For example, when
running nested RPC A->B->A->B, the second recv on B cannot start until
the first recv on B finishes. If the second recv is triggered by a nested
RPC in the first recv, it will deadlock. Ideally, we should expose sth like
responder or FutureResult to the Python land to support nested asynchronous
UDFs.

This diff adds a shared ThreadPool for send and recv. Send use it do send
out messages, and recv use it to process received messages. There is still
a dedicated thread to listen for incoming messages and add it to task queue.
There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a
temporary solution for (a small number of) nested RPCs

ghstack-source-id: 88476246

Differential Revision: D16695091

fbshipit-source-id: fd18a5c65e7fcd1331b73d1287673e6e10d2dd86
2019-08-16 17:49:05 -07:00
Yanli Zhao
ab39a55331 python udf over rpc (#23569)
Summary:
This diff is to support python user defined function over rpc for https://github.com/pytorch/pytorch/issues/23110, work flow is like this:
1. pickle python udf
2. pass pickle to C++
3. C++ pass over rpc from client to server
4. server call runPythonUDF() python function to unpickle and run python udf and pickle the udf result using python embedder
6. pass back serialized result from server to client
7. client call loadPythonUDFResult() python function to unpickle result
7. return it to python

right now, put rpc_sync_builtin() and rpc_async_builtin() as temporary interfaces for builtin operator remote calls, they accept qualified name string, this interface can execute builtin operators in C++ land.

rpc_sync() and rpc_async() accept python callables only right now, it could be user define python functions or builtin operator python functions, the python functions will be executed in python land.

once we can resolve builtin operator python callables to qualified name string, we can merge rpc_sync_builtin() into rpc_sync() then
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23569

Test Plan: unit tests

Differential Revision: D16390764

Pulled By: zhaojuanmao

fbshipit-source-id: 2cf2c22a979646830b5581bd75eabf8b3cca564c
2019-08-14 23:13:33 -07:00
Shen Li
8b349073ce sync and async torch.distributed.rpc for builtin operators (#23228)
Summary:
Features:

* sync and async RPC for builtin operators
* RpcAgent API
* ProcessGroupAgent implementation

Goal:

* have a minimum working and testable RPC implementation
* make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation
  * For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object.
  * For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...).
* support blocking and non-blocking RequestCallback
  * blocking means the callback won't return before sending out the response
  * non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list.

We are not exporting this diff until we finalize distributed autograd design and publish the API review publicly.

https://fb.quip.com/FabTAZKVgQpf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23228
ghstack-source-id: 87816717

Reviewed By: zhaojuanmao

Differential Revision: D15194693

fbshipit-source-id: 7adb600796613cde6073db6c227451b89940ecaf
2019-08-06 16:03:01 -07:00