pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Shihao Xu	937e3f1db4	Enable RRef tests for other RPCAgent Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27789 Differential Revision: D5444828 fbshipit-source-id: a2fa5a603e4b2970755bc5d16f6b2c84d65b0811	2019-10-14 17:42:23 -07:00
Rohan Varma	b5e0fd4c56	add known worker ids to distributed autograd context (#26324 ) Summary: Per https://github.com/pytorch/pytorch/issues/25525 we want to clean up distributed autograd context on all nodes, in addition to the local one. To do this, we want to send async RPCs to the other nodes telling them to clean up the context. The first step for this is for a node's context to know about the other workers. This PR does two things: 1) Adds the necessary data structures and getter functions to `DistAutogradContext` 2) Refactors calls to `addSendRpcBackward` to take in the `worker_id` as an additional argument Pull Request resolved: https://github.com/pytorch/pytorch/pull/26324 Differential Revision: D17769411 Pulled By: rohan-varma fbshipit-source-id: b7327d1209a574e2e88cb197edff3103024d51ad	2019-10-14 10:43:09 -07:00
Pritam Damania	3bccd3fc0d	Distributed Autograd - FAST mode backward pass implementation. (#27022 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022 This change implements the "FAST" mode distributed autograd backward pass as described in https://github.com/pytorch/pytorch/issues/23110. At a high level the backward pass works as follows: 1. We start by computing dependencies on the node that calls `torch.distributed.backward`. 2. This node computes the dependencies starting from the root nodes provided in the backward call and all the 'send' functions present in the current autograd context. The "FAST" mode assumes all 'send' functions are part of the autograd computation. 3. Once the dependency computation is done, the distributed autograd engine calls the local autograd engine to execute the autograd graph. Note that the autograd graph on a single node is not necessarily connected because of inter-node communication. As a result, we have special handling to ensure the local autograd engine ensures we execute the entire graph starting from the provided roots and all 'send' functions on the node. 4. When the local autograd engine hits a 'recv' function, it performs an async RPC to send the gradients over to the appropriate node and stores a future in the autograd context to keep track of this RPC. 5. On the destination node, the appropriate 'send' function is looked up and enqueued on the local autograd engine. If this is the first time the node is hearing about this autograd context id on the backward pass, then the node computes dependencies for the local autograd engine. 6. As part of compute dependencies, the distributed autograd engine discovers all leaf nodes and ensures those are passed as 'outputs' to the local autograd engine. This avoids running the 'AccumulateGrad' function. 7. The gradients computed for the leaf nodes are then actually accumulated in `DistAutogradContext` for the appropriate autograd context id. 8. The distributed autograd engine waits for the local autograd engine to complete and also waits for all the 'Futures' (stored in 4.) for respective RPCs to finish. We have made the following changes to the local autograd engine for this purpose: 1. Expose GraphTask and NodeTask so that the distributed autograd engine can use them. 2. Expose a `execute_with_graph_task` API which gives the distributed engine to build a GraphTask and pass it to the local autograd engine. 3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build a `NodeTask` for a 'send' function and enqueue it on the local autograd engine. In addition to this a few general improvements: 1. Added a `PropagateGradients` RPC call for the 'recv' function to pass gradients to the appropriate node during the backward pass. 2. Use IValues as much as possible in serialization for RpcWithAutograd. 3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate exception instead of just returning the message. This is inline with what most Future.wait() APIs do. 4. Added a `get_gradients(context_id)` API which allows users to retrieve a map from Tensor to respective gradient for the provided context_id on the local node. ghstack-source-id: 91794926 Test Plan: unit tests. Differential Revision: D17652615 fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3	2019-10-12 09:47:49 -07:00
Shihao Xu	130127ca59	Rename `BACKEND` to be `RPC_BACKEND` to be seperated from `COMMUNICATION_BACKEND` like gloo,nccl, in `rpc_test.py` (#27792 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27792 Close https://github.com/pytorch/pytorch/issues/27232 ghstack-source-id: 91807741 Differential Revision: D5474297 fbshipit-source-id: 5b230a6857813ec981e5056880abb5859655daa2	2019-10-11 19:49:46 -07:00
Rohan Varma	ccd460d415	use gloo enum instead of hardcoding stirng (#27652 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27652 Changes "gloo" to dist.backend.GLOO in rpc_test.py. ghstack-source-id: 91764460 Test Plan: python test/test_rpc_fork.py && python test/test_rpc_spawn.py Differential Revision: D17845067 fbshipit-source-id: b220d3672d1e0b237da474276663d157230a4fdb	2019-10-11 19:06:23 -07:00
Hong Xu	987e37b9c2	Enable EXE001 flake8 check. (#27560 ) Summary: According to https://github.com/pytorch/pytorch/issues/27285 , seems we do not intend to use shebang as an indication of Python version, thus we enable EXE001 flake8 check. For violations, we either remove shebang from non-executable Python scripts or grant them executable permission. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27560 Differential Revision: D17831782 Pulled By: ezyang fbshipit-source-id: 6282fd3617b25676a6d959af0d318faf05c09b26	2019-10-09 09:15:29 -07:00
Pieter Noordhuis	b4ce922b58	Move RPC API to torch.distributed.rpc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27290 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D17808212 Pulled By: pietern fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454	2019-10-08 11:31:25 -07:00
Pieter Noordhuis	a6d26ce135	Move internal functions to torch.distributed.rpc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27289 Test Plan: Imported from OSS Differential Revision: D17808214 Pulled By: pietern fbshipit-source-id: 4c453028e431c3e951d439784017ef07037ba1a9	2019-10-08 11:31:20 -07:00
Pieter Noordhuis	14f1629c4d	Move RPC backend registry to torch.distributed.rpc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27288 Test Plan: Imported from OSS Differential Revision: D17808215 Pulled By: pietern fbshipit-source-id: 489c031e02cd3141a861cf7ec2273aaa4c55b7d7	2019-10-08 11:31:16 -07:00
Shihao Xu	e166bcbbde	Make RpcTest re-usable by other RPC backends by using init_method to initialize a RPC backend (#27320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27320 https://github.com/pytorch/pytorch/pull/27208/ # Problem Other RPC backends take init_method. # Solution Set up init_method in rpc tests. ghstack-source-id: 91335127 Differential Revision: D17709219 fbshipit-source-id: 3184c6e9b922a6ff9f4d1cb9abfa118b23f43eeb	2019-10-04 09:20:05 -07:00
Shihao Xu	86a8971ebb	Add a test case to RpcTest, check src/dst (#27322 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27322 # Problem Existing test cases are too symmetric, so that didn't detect this error, request sent to the wrong worker. Because of wrong `worker_names` setup, worker0 sends request to itself, while it should had sent to worker1. # Solution Add a test case, letting the dst side to check if it's an request from the expected src. ghstack-source-id: 91299312 Reviewed By: satgera Differential Revision: D17069062 fbshipit-source-id: ef7a532dd497bfc0f0ee8446fcd5d29656aaf175	2019-10-03 18:59:59 -07:00
Shen Li	2486b0ba82	Add Python RRef as args and return value (#25499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499 See #23110 for model parallel design details, and #26759 for the RRef protocol. This commit add support for using RRef as Python UDF arguments and return value. RRefs can now be shared from owner to user, from user to owner, or from user to user. Limitations: 1. No implicit type conversion yet. (#27099) 2. No failure handling and retry. (#26116) 3. UDF is not yet blocked until all RRefs are confirmed. (#27098) 4. Internal RRef control messages are not idempotent yet. (#26116) 5. Cannot delete RRefs correctly when there are circular dependencies. (#27096) Main changes: 1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations. 2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages. 3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`. 4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure. 5. Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs. 6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`. Test Plan: Imported from OSS buck test mode/dev-nosan //caffe2/test:rpc_fork Differential Revision: D17184146 Pulled By: mrshenli fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265	2019-10-03 17:47:12 -07:00
Pritam Damania	fe4170bda8	Add send and recv backward functions for builtin operators RPC. (#25527 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527 Master GH issue: https://github.com/pytorch/pytorch/issues/23110. This change builds upon https://github.com/pytorch/pytorch/pull/24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. ghstack-source-id: 91240466 Test Plan: unit tests. Differential Revision: D17148077 fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233	2019-10-03 01:18:46 -07:00
Shen Li	2491dd50ee	Add ProcessGroupAgent termination detection algorithm (#26984 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26984 closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Test Plan: Imported from OSS Differential Revision: D17633456 Pulled By: mrshenli fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca	2019-10-02 13:15:03 -07:00
Yanli Zhao	631e2ee7a4	make python udf serialization format to be binary plus tensor tables (#27136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136 make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators ghstack-source-id: 91156141 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D17405686 fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd	2019-10-02 00:10:32 -07:00
Shihao Xu	00e588290b	Add test case for init_rpc_backend (#26997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26997 Reverting accidental change in https://github.com/pytorch/pytorch/pull/26919 ghstack-source-id: 91126906 Reviewed By: zhaojuanmao Differential Revision: D17637468 fbshipit-source-id: 9ffcf4b15b37effe6b5d5f82338ff89298c82a52	2019-10-01 15:44:34 -07:00
Yanli Zhao	1d2d59dd79	make rpc and dist-autograd multiprocess test to use both fork and spawn (#25656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91096705 Test Plan: unit tests Reviewed By: xush6528 Differential Revision: D17086007 fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922	2019-10-01 11:15:22 -07:00

1 2

67 Commits