pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Hong Xu	987e37b9c2	Enable EXE001 flake8 check. (#27560 ) Summary: According to https://github.com/pytorch/pytorch/issues/27285 , seems we do not intend to use shebang as an indication of Python version, thus we enable EXE001 flake8 check. For violations, we either remove shebang from non-executable Python scripts or grant them executable permission. Pull Request resolved: https://github.com/pytorch/pytorch/pull/27560 Differential Revision: D17831782 Pulled By: ezyang fbshipit-source-id: 6282fd3617b25676a6d959af0d318faf05c09b26	2019-10-09 09:15:29 -07:00
Pieter Noordhuis	b4ce922b58	Move RPC API to torch.distributed.rpc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27290 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D17808212 Pulled By: pietern fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454	2019-10-08 11:31:25 -07:00
Pieter Noordhuis	a6d26ce135	Move internal functions to torch.distributed.rpc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27289 Test Plan: Imported from OSS Differential Revision: D17808214 Pulled By: pietern fbshipit-source-id: 4c453028e431c3e951d439784017ef07037ba1a9	2019-10-08 11:31:20 -07:00
Pieter Noordhuis	14f1629c4d	Move RPC backend registry to torch.distributed.rpc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27288 Test Plan: Imported from OSS Differential Revision: D17808215 Pulled By: pietern fbshipit-source-id: 489c031e02cd3141a861cf7ec2273aaa4c55b7d7	2019-10-08 11:31:16 -07:00
Shihao Xu	e166bcbbde	Make RpcTest re-usable by other RPC backends by using init_method to initialize a RPC backend (#27320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27320 https://github.com/pytorch/pytorch/pull/27208/ # Problem Other RPC backends take init_method. # Solution Set up init_method in rpc tests. ghstack-source-id: 91335127 Differential Revision: D17709219 fbshipit-source-id: 3184c6e9b922a6ff9f4d1cb9abfa118b23f43eeb	2019-10-04 09:20:05 -07:00
Shihao Xu	86a8971ebb	Add a test case to RpcTest, check src/dst (#27322 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27322 # Problem Existing test cases are too symmetric, so that didn't detect this error, request sent to the wrong worker. Because of wrong `worker_names` setup, worker0 sends request to itself, while it should had sent to worker1. # Solution Add a test case, letting the dst side to check if it's an request from the expected src. ghstack-source-id: 91299312 Reviewed By: satgera Differential Revision: D17069062 fbshipit-source-id: ef7a532dd497bfc0f0ee8446fcd5d29656aaf175	2019-10-03 18:59:59 -07:00
Shen Li	2486b0ba82	Add Python RRef as args and return value (#25499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499 See #23110 for model parallel design details, and #26759 for the RRef protocol. This commit add support for using RRef as Python UDF arguments and return value. RRefs can now be shared from owner to user, from user to owner, or from user to user. Limitations: 1. No implicit type conversion yet. (#27099) 2. No failure handling and retry. (#26116) 3. UDF is not yet blocked until all RRefs are confirmed. (#27098) 4. Internal RRef control messages are not idempotent yet. (#26116) 5. Cannot delete RRefs correctly when there are circular dependencies. (#27096) Main changes: 1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations. 2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages. 3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`. 4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure. 5. Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs. 6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`. Test Plan: Imported from OSS buck test mode/dev-nosan //caffe2/test:rpc_fork Differential Revision: D17184146 Pulled By: mrshenli fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265	2019-10-03 17:47:12 -07:00
Pritam Damania	fe4170bda8	Add send and recv backward functions for builtin operators RPC. (#25527 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527 Master GH issue: https://github.com/pytorch/pytorch/issues/23110. This change builds upon https://github.com/pytorch/pytorch/pull/24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. ghstack-source-id: 91240466 Test Plan: unit tests. Differential Revision: D17148077 fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233	2019-10-03 01:18:46 -07:00
Shen Li	2491dd50ee	Add ProcessGroupAgent termination detection algorithm (#26984 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26984 closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Test Plan: Imported from OSS Differential Revision: D17633456 Pulled By: mrshenli fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca	2019-10-02 13:15:03 -07:00
Yanli Zhao	631e2ee7a4	make python udf serialization format to be binary plus tensor tables (#27136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136 make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators ghstack-source-id: 91156141 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D17405686 fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd	2019-10-02 00:10:32 -07:00
Shihao Xu	00e588290b	Add test case for init_rpc_backend (#26997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26997 Reverting accidental change in https://github.com/pytorch/pytorch/pull/26919 ghstack-source-id: 91126906 Reviewed By: zhaojuanmao Differential Revision: D17637468 fbshipit-source-id: 9ffcf4b15b37effe6b5d5f82338ff89298c82a52	2019-10-01 15:44:34 -07:00
Yanli Zhao	1d2d59dd79	make rpc and dist-autograd multiprocess test to use both fork and spawn (#25656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25656 spawn multiprocessing can catch some issues that fork multiprocessing can not catch, meanwhile fork can work properly with asan tests, but spawn multiprocessing can not work with asan tests for some use cases right now. so this diff adding support to launch both spawn and fork tests in multiProcessingTestCase class, also let test_rpc and test_dist_autograd to run both spawn and fork tests ghstack-source-id: 91096705 Test Plan: unit tests Reviewed By: xush6528 Differential Revision: D17086007 fbshipit-source-id: af2446e7abe948c37081cff24ed060fd87f84922	2019-10-01 11:15:22 -07:00

12 Commits