Commit Graph

67 Commits

Author SHA1 Message Date
Shihao Xu
937e3f1db4 Enable RRef tests for other RPCAgent
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27789

Differential Revision: D5444828

fbshipit-source-id: a2fa5a603e4b2970755bc5d16f6b2c84d65b0811
2019-10-14 17:42:23 -07:00
Rohan Varma
b5e0fd4c56 add known worker ids to distributed autograd context (#26324)
Summary:
Per https://github.com/pytorch/pytorch/issues/25525 we want to clean up distributed autograd context on all nodes, in addition to the local one. To do this, we want to send async RPCs to the other nodes telling them to clean up the context.

The first step for this is for a node's context to know about the other workers. This PR does two things:

1) Adds the necessary data structures and getter functions to `DistAutogradContext`
2) Refactors calls to `addSendRpcBackward` to take in the `worker_id` as an additional argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26324

Differential Revision: D17769411

Pulled By: rohan-varma

fbshipit-source-id: b7327d1209a574e2e88cb197edff3103024d51ad
2019-10-14 10:43:09 -07:00
Pritam Damania
3bccd3fc0d Distributed Autograd - FAST mode backward pass implementation. (#27022)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022

This change implements the "FAST" mode distributed autograd backward
pass as described in https://github.com/pytorch/pytorch/issues/23110.

At a high level the backward pass works as follows:
1. We start by computing dependencies on the node that calls
`torch.distributed.backward`.
2. This node computes the dependencies starting from the root nodes provided in
the backward call and all the 'send' functions present in the current autograd
context. The "FAST" mode assumes all 'send' functions are part of the autograd
computation.
3. Once the dependency computation is done, the distributed autograd engine
calls the local autograd engine to execute the autograd graph. Note that the
autograd graph on a single node is not necessarily connected because of
inter-node communication. As a result, we have special handling to ensure the
local autograd engine ensures we execute the entire graph starting from the
provided roots and all 'send' functions on the node.
4. When the local autograd engine hits a 'recv' function, it performs an async
RPC to send the gradients over to the appropriate node and stores a future in
the autograd context to keep track of this RPC.
5. On the destination node, the appropriate 'send' function is looked up and
enqueued on the local autograd engine. If this is the first time the node is
hearing about this autograd context id on the backward pass, then the node
computes dependencies for the local autograd engine.
6. As part of compute dependencies, the distributed autograd engine discovers
all leaf nodes and ensures those are passed as 'outputs' to the local autograd
engine. This avoids running the 'AccumulateGrad' function.
7. The gradients computed for the leaf nodes are then actually accumulated in
`DistAutogradContext` for the appropriate autograd context id.
8. The distributed autograd engine waits for the local autograd engine
to complete and also waits for all the 'Futures' (stored in 4.) for respective
RPCs to finish.

We have made the following changes to the local autograd engine for this
purpose:

1. Expose GraphTask and NodeTask so that the distributed autograd engine can
use them.
2. Expose a `execute_with_graph_task` API which gives the distributed engine
to build a GraphTask and pass it to the local autograd engine.
3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build
a `NodeTask` for a 'send' function and enqueue it on the local autograd engine.

In addition to this a few general improvements:
1. Added a `PropagateGradients` RPC call for the 'recv' function to pass
gradients to the appropriate node during the backward pass.
2. Use IValues as much as possible in serialization for RpcWithAutograd.
3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate
exception instead of just returning the message. This is inline with what most
Future.wait() APIs do.
4. Added a `get_gradients(context_id)` API which allows users to retrieve a map
from Tensor to respective gradient for the provided context_id on the local
node.
ghstack-source-id: 91794926

Test Plan: unit tests.

Differential Revision: D17652615

fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3
2019-10-12 09:47:49 -07:00
Shihao Xu
130127ca59 Rename BACKEND to be RPC_BACKEND to be seperated from COMMUNICATION_BACKEND like gloo,nccl, in rpc_test.py (#27792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27792

Close https://github.com/pytorch/pytorch/issues/27232
ghstack-source-id: 91807741

Differential Revision: D5474297

fbshipit-source-id: 5b230a6857813ec981e5056880abb5859655daa2
2019-10-11 19:49:46 -07:00
Rohan Varma
ccd460d415 use gloo enum instead of hardcoding stirng (#27652)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27652

Changes "gloo" to dist.backend.GLOO in rpc_test.py.
ghstack-source-id: 91764460

Test Plan: python test/test_rpc_fork.py && python test/test_rpc_spawn.py

Differential Revision: D17845067

fbshipit-source-id: b220d3672d1e0b237da474276663d157230a4fdb
2019-10-11 19:06:23 -07:00
Hong Xu
987e37b9c2 Enable EXE001 flake8 check. (#27560)
Summary:
According to https://github.com/pytorch/pytorch/issues/27285 , seems we do not intend to use shebang as an indication of Python version, thus
we enable EXE001 flake8 check.
For violations, we either remove shebang from non-executable Python scripts or grant them executable permission.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27560

Differential Revision: D17831782

Pulled By: ezyang

fbshipit-source-id: 6282fd3617b25676a6d959af0d318faf05c09b26
2019-10-09 09:15:29 -07:00
Pieter Noordhuis
b4ce922b58 Move RPC API to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27290

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D17808212

Pulled By: pietern

fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454
2019-10-08 11:31:25 -07:00
Pieter Noordhuis
a6d26ce135 Move internal functions to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27289

Test Plan: Imported from OSS

Differential Revision: D17808214

Pulled By: pietern

fbshipit-source-id: 4c453028e431c3e951d439784017ef07037ba1a9
2019-10-08 11:31:20 -07:00
Pieter Noordhuis
14f1629c4d Move RPC backend registry to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27288

Test Plan: Imported from OSS

Differential Revision: D17808215

Pulled By: pietern

fbshipit-source-id: 489c031e02cd3141a861cf7ec2273aaa4c55b7d7
2019-10-08 11:31:16 -07:00
Shihao Xu
e166bcbbde Make RpcTest re-usable by other RPC backends by using init_method to initialize a RPC backend (#27320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27320

https://github.com/pytorch/pytorch/pull/27208/

# Problem

Other RPC backends take init_method.

# Solution

Set up init_method in rpc tests.
ghstack-source-id: 91335127

Differential Revision: D17709219

fbshipit-source-id: 3184c6e9b922a6ff9f4d1cb9abfa118b23f43eeb
2019-10-04 09:20:05 -07:00
Shihao Xu
86a8971ebb Add a test case to RpcTest, check src/dst (#27322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27322

# Problem

Existing test cases are too symmetric, so that didn't detect this error, request sent to the wrong worker.

Because of wrong `worker_names` setup, worker0 sends request to itself, while it should had sent to worker1.

# Solution

Add a test case, letting the dst side to check if it's an request from the expected src.
ghstack-source-id: 91299312

Reviewed By: satgera

Differential Revision: D17069062

fbshipit-source-id: ef7a532dd497bfc0f0ee8446fcd5d29656aaf175
2019-10-03 18:59:59 -07:00
Shen Li
2486b0ba82 Add Python RRef as args and return value (#25499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499

See #23110 for model parallel design details, and #26759 for the RRef
protocol. This commit add support for using RRef as Python UDF arguments
and return value. RRefs can now be shared from owner to user, from user to
owner, or from user to user.

Limitations:
1. No implicit type conversion yet. (#27099)
2. No failure handling and retry. (#26116)
3. UDF is not yet blocked until all RRefs are confirmed. (#27098)
4. Internal RRef control messages are not idempotent yet. (#26116)
5. Cannot delete RRefs correctly when there are circular dependencies. (#27096)

Main changes:

1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations.
2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages.
3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`.
4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure.
5.  Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs.
6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`.

Test Plan:
Imported from OSS

buck test mode/dev-nosan //caffe2/test:rpc_fork

Differential Revision: D17184146

Pulled By: mrshenli

fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265
2019-10-03 17:47:12 -07:00
Pritam Damania
fe4170bda8 Add send and recv backward functions for builtin operators RPC. (#25527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527

Master GH issue: https://github.com/pytorch/pytorch/issues/23110.

This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466

Test Plan: unit tests.

Differential Revision: D17148077

fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233
2019-10-03 01:18:46 -07:00
Shen Li
2491dd50ee Add ProcessGroupAgent termination detection algorithm (#26984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26984

closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Test Plan: Imported from OSS

Differential Revision: D17633456

Pulled By: mrshenli

fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca
2019-10-02 13:15:03 -07:00
Yanli Zhao
631e2ee7a4 make python udf serialization format to be binary plus tensor tables (#27136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136

make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators
ghstack-source-id: 91156141

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D17405686

fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd
2019-10-02 00:10:32 -07:00
Shihao Xu
00e588290b Add test case for init_rpc_backend (#26997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26997

Reverting accidental change in https://github.com/pytorch/pytorch/pull/26919
ghstack-source-id: 91126906

Reviewed By: zhaojuanmao

Differential Revision: D17637468

fbshipit-source-id: 9ffcf4b15b37effe6b5d5f82338ff89298c82a52
2019-10-01 15:44:34 -07:00
Yanli Zhao
1d2d59dd79 make rpc and dist-autograd multiprocess test to use both fork and spawn (#25656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91096705

Test Plan: unit tests

Reviewed By: xush6528

Differential Revision: D17086007

fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
2019-10-01 11:15:22 -07:00