Commit Graph

12 Commits

Author SHA1 Message Date
Hong Xu
987e37b9c2 Enable EXE001 flake8 check. (#27560)
Summary:
According to https://github.com/pytorch/pytorch/issues/27285 , seems we do not intend to use shebang as an indication of Python version, thus
we enable EXE001 flake8 check.
For violations, we either remove shebang from non-executable Python scripts or grant them executable permission.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27560

Differential Revision: D17831782

Pulled By: ezyang

fbshipit-source-id: 6282fd3617b25676a6d959af0d318faf05c09b26
2019-10-09 09:15:29 -07:00
Pieter Noordhuis
b4ce922b58 Move RPC API to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27290

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D17808212

Pulled By: pietern

fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454
2019-10-08 11:31:25 -07:00
Pieter Noordhuis
a6d26ce135 Move internal functions to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27289

Test Plan: Imported from OSS

Differential Revision: D17808214

Pulled By: pietern

fbshipit-source-id: 4c453028e431c3e951d439784017ef07037ba1a9
2019-10-08 11:31:20 -07:00
Pieter Noordhuis
14f1629c4d Move RPC backend registry to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27288

Test Plan: Imported from OSS

Differential Revision: D17808215

Pulled By: pietern

fbshipit-source-id: 489c031e02cd3141a861cf7ec2273aaa4c55b7d7
2019-10-08 11:31:16 -07:00
Shihao Xu
e166bcbbde Make RpcTest re-usable by other RPC backends by using init_method to initialize a RPC backend (#27320)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27320

https://github.com/pytorch/pytorch/pull/27208/

# Problem

Other RPC backends take init_method.

# Solution

Set up init_method in rpc tests.
ghstack-source-id: 91335127

Differential Revision: D17709219

fbshipit-source-id: 3184c6e9b922a6ff9f4d1cb9abfa118b23f43eeb
2019-10-04 09:20:05 -07:00
Shihao Xu
86a8971ebb Add a test case to RpcTest, check src/dst (#27322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27322

# Problem

Existing test cases are too symmetric, so that didn't detect this error, request sent to the wrong worker.

Because of wrong `worker_names` setup, worker0 sends request to itself, while it should had sent to worker1.

# Solution

Add a test case, letting the dst side to check if it's an request from the expected src.
ghstack-source-id: 91299312

Reviewed By: satgera

Differential Revision: D17069062

fbshipit-source-id: ef7a532dd497bfc0f0ee8446fcd5d29656aaf175
2019-10-03 18:59:59 -07:00
Shen Li
2486b0ba82 Add Python RRef as args and return value (#25499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499

See #23110 for model parallel design details, and #26759 for the RRef
protocol. This commit add support for using RRef as Python UDF arguments
and return value. RRefs can now be shared from owner to user, from user to
owner, or from user to user.

Limitations:
1. No implicit type conversion yet. (#27099)
2. No failure handling and retry. (#26116)
3. UDF is not yet blocked until all RRefs are confirmed. (#27098)
4. Internal RRef control messages are not idempotent yet. (#26116)
5. Cannot delete RRefs correctly when there are circular dependencies. (#27096)

Main changes:

1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations.
2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages.
3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`.
4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure.
5.  Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs.
6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`.

Test Plan:
Imported from OSS

buck test mode/dev-nosan //caffe2/test:rpc_fork

Differential Revision: D17184146

Pulled By: mrshenli

fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265
2019-10-03 17:47:12 -07:00
Pritam Damania
fe4170bda8 Add send and recv backward functions for builtin operators RPC. (#25527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527

Master GH issue: https://github.com/pytorch/pytorch/issues/23110.

This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466

Test Plan: unit tests.

Differential Revision: D17148077

fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233
2019-10-03 01:18:46 -07:00
Shen Li
2491dd50ee Add ProcessGroupAgent termination detection algorithm (#26984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26984

closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Test Plan: Imported from OSS

Differential Revision: D17633456

Pulled By: mrshenli

fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca
2019-10-02 13:15:03 -07:00
Yanli Zhao
631e2ee7a4 make python udf serialization format to be binary plus tensor tables (#27136)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136

make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators
ghstack-source-id: 91156141

Test Plan: unit tests

Reviewed By: pritamdamania87

Differential Revision: D17405686

fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd
2019-10-02 00:10:32 -07:00
Shihao Xu
00e588290b Add test case for init_rpc_backend (#26997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26997

Reverting accidental change in https://github.com/pytorch/pytorch/pull/26919
ghstack-source-id: 91126906

Reviewed By: zhaojuanmao

Differential Revision: D17637468

fbshipit-source-id: 9ffcf4b15b37effe6b5d5f82338ff89298c82a52
2019-10-01 15:44:34 -07:00
Yanli Zhao
1d2d59dd79 make rpc and dist-autograd multiprocess test to use both fork and spawn (#25656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25656

spawn multiprocessing can catch some issues that fork multiprocessing can not
catch, meanwhile fork can work properly with asan tests, but spawn
multiprocessing can not work with asan tests for some use cases right now.

so this diff adding support to launch both spawn and fork tests in
multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run
both spawn and fork tests
ghstack-source-id: 91096705

Test Plan: unit tests

Reviewed By: xush6528

Differential Revision: D17086007

fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922
2019-10-01 11:15:22 -07:00