pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Pritam Damania	c06f9023e5	Polish rpc docstring. (#30069 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30069 1) Fix rpc docstrings 2) Fix some links ghstack-source-id: 94250890 Test Plan: waitforbuildbot Differential Revision: D18588231 fbshipit-source-id: 33846ace1afa94d25f34b0370437abf6d9408f06	2019-11-19 23:10:14 -08:00
Shihao Xu	80e3f17301	Resubmit "Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s" (#30093 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093 https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. ghstack-source-id: 94197295 Test Plan: ### OSS RPC + RRef tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc ``` ### Prototype RRef tests ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent ``` ### Dist autograd ``` buck test mode/dev-nosan caffe2/test:dist_autograd_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test ``` Differential Revision: D18595578 fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065	2019-11-19 18:52:30 -08:00
Edward Yang	1dda8186ae	Revert D18549919: Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s Test Plan: revert-hammer Differential Revision: D18549919 Original commit changeset: b9f3f1a41d1f fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee	2019-11-19 08:14:40 -08:00
Rohan Varma	83513506c3	poll for timed out futures in process group agent (#29601 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29601 Follow up from https://github.com/pytorch/pytorch/pull/28392. Adds a background thread to `ProcessGroupAgent` that polls for timed out RPCs at a pre-set interval, and marks them as completed with a timeout exception if they have timed out. Also deletes the futures from the corresponding maps `futures_` and `futureTimeouts`. Unit tests are added to ensure that timed out RPCs are appropriately cleaned up. Also adds a `shutdown` variable to process group agent to control the shutting down of this background thread, which can eventually be extended to use for controlling a clean shutdown of process group agent. ghstack-source-id: 94175131 Test Plan: Added unit tests Differential Revision: D18434215 fbshipit-source-id: c48abdb8759fe1447200ec66bb9d4b1c50ec4535	2019-11-19 06:42:04 -08:00
Shihao Xu	21dc1d4543	Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s (#29972 ) Summary: https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. closes https://github.com/pytorch/pytorch/issues/29031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972 Differential Revision: D18549919 Pulled By: xush6528 fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b	2019-11-19 01:00:08 -08:00
Alisson Gusatti Azzolini	97156f548d	Add hash and equality operators for WorkerInfo (#29958 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29958 DistributedOptimizer relies on hashing WorkerInfo in order to coalesce fan-out RPCs. This will likely be a very common use case (EASGD will do the same, for example). ghstack-source-id: 94169198 Test Plan: unit test. Differential Revision: D18548257 fbshipit-source-id: 7d67d4e1b9bc60403c372164982a75ae8c1d8389	2019-11-18 20:47:13 -08:00
Rohan Varma	371da6acef	move get_rpc_timeout to pybind (#29765 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29765 instead of wrapping this C++ function with python that causes unnecessary overhead, we can move this to pybind and use the `DefaultRpcAgent` to get the timeout. ghstack-source-id: 93879236 Test Plan: unit tests pass Differential Revision: D18493195 fbshipit-source-id: fd0f1f13ee15acb5ea1ae7c696925c9b54304f6d	2019-11-14 19:39:22 -08:00
Rohan Varma	06ef4a757d	Add docs for RPC, dist autograd, and RRef modules (#29276 ) Summary: Closes https://github.com/pytorch/pytorch/issues/28983. Documentation for `torch.distributed.rpc` and `torch.distributed.autograd` modules. Also fixes/tidies up some of the docstrings in rpc/autograd, and moves some functions to be private so they don't show up in the documentation. Note: Much of the text to describe/explain the RPC/RRef layers are taken from the following RFCs: https://github.com/pytorch/pytorch/issues/23110, https://github.com/pytorch/pytorch/issues/26759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29276 Differential Revision: D18478754 Pulled By: rohan-varma fbshipit-source-id: e9a7089baf5275304e5408d319eb9bf98e53fff8	2019-11-14 14:32:03 -08:00
Alisson Gusatti Azzolini	93b5c9d723	Allow to create local RRef with value (#28948 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28948 Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. ghstack-source-id: 93565010 Test Plan: unit test. Differential Revision: D18241227 fbshipit-source-id: b9e9b958f40623348d62ee6fc9e7f0414b4215b7	2019-11-11 12:19:45 -08:00
Shen Li	63675b1969	Revert RRef.to_here()/local_value() return type (#29396 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29396 The return types of RRef.to_here()/local_value() were recently changed to Future, which triggers flakiness as the RRef could be deleted before the future.wait() finishes. While we are still discussing how we'd like to solve it, this commit reverts the return type to stop bleeding in tests. closes #28885 Test Plan: Imported from OSS Differential Revision: D18375571 Pulled By: mrshenli fbshipit-source-id: 354dbf38b15ab804e44fc9968dd30888415c1fab	2019-11-08 08:31:18 -08:00
Shihao Xu	e66626ae5c	Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. (#29341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341 So that other RpcAgent could use this timeout setting as well. ghstack-source-id: 93481902 Differential Revision: D5681951 fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc	2019-11-07 17:05:45 -08:00
Pieter Noordhuis	b4df413712	Scope pybind11 functions to torch.distributed.{autograd,rpc} Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27529 Test Plan: Imported from OSS Differential Revision: D17808209 Pulled By: pietern fbshipit-source-id: 1e3e086085167320c3fc369467f5d75ce39fa4ea	2019-11-05 06:25:22 -08:00
Rohan Varma	fd0f9811ad	add timeout for RPC futures, and ability to set timeout when initializing rpc (#28392 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392 Per #25531, we want to clean up futures when we detect that there are failures/timeouts. As a first step, this diff adds timers to the future object, provides functionality to check if a future is timed out, and allows specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out. ghstack-source-id: 93192622 Test Plan: Added unit tests. Differential Revision: D18025163 fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33	2019-11-04 14:43:03 -08:00
Shen Li	e31adeb4f3	Make RRef::LocalValue return Future (#28025 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28025 Add a PyFuture type which is wrapper of either an OwnerRRef or a jit::Future. The difference between PyFuture and jit::Future is that PyFuture can return an custom py::object type. Test Plan: Imported from OSS Differential Revision: D17936746 Pulled By: mrshenli fbshipit-source-id: a7451af3993d98aeab462ffd5318fc6d28f915c8	2019-10-23 17:07:16 -07:00
Yanli Zhao	3214f134b6	fix python rpc handler exit crash (#27251 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27251 Explicitly clean up py::objects to avoid segment faults when py::objects with CPython are cleaned up later at program exit. See similar issues reported https://github.com/pybind/pybind11/issues/1598 and https://github.com/pybind/pybind11/issues/1493. Our local tests also caught this segment faults if py::objects are cleaned up at program exit. The explaination is: CPython cleans up most critical utitlies before cleaning up PythonRpcHandler singleton, so when PythonRpcHandler signleton cleans up py::objects and call dec_ref(), it will crash. The solution is to clean up py::objects earlier when Rpc agent join(). Be note that py::objects can not be cleaned up when Rpc agent is destroyed as well, as Rpc agent is global variable and it will have same issue as PythonRpcHandler. close #27182 ghstack-source-id: 92035069 Test Plan: unit tests on python 3.6 and python 3.5 Differential Revision: D17727362 fbshipit-source-id: c254023f6a85acce35528ba756a4efabba9a519f	2019-10-16 16:57:38 -07:00
Shen Li	59cd0faeff	Defer pg agent listener thread until contexts are initialized (#28013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28013 ProcessGroupAgent currently kicks off the listener thread in its constructor. However, serving requests requires contexts to be initialized, e.g., RRefContext and agent_ global var in api.py, which might not be done yet when the first request arrives. ProcessGroupAgent does not know what would be the appropriate time to start the listener thread, hence exposing an API for higher layer code to explicitly start listeners. Test Plan: Imported from OSS Differential Revision: D17932271 Pulled By: mrshenli fbshipit-source-id: 3b408477594d4d19319e7cd08dd6f383a7ed7670	2019-10-15 17:45:43 -07:00
Pritam Damania	3bccd3fc0d	Distributed Autograd - FAST mode backward pass implementation. (#27022 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022 This change implements the "FAST" mode distributed autograd backward pass as described in https://github.com/pytorch/pytorch/issues/23110. At a high level the backward pass works as follows: 1. We start by computing dependencies on the node that calls `torch.distributed.backward`. 2. This node computes the dependencies starting from the root nodes provided in the backward call and all the 'send' functions present in the current autograd context. The "FAST" mode assumes all 'send' functions are part of the autograd computation. 3. Once the dependency computation is done, the distributed autograd engine calls the local autograd engine to execute the autograd graph. Note that the autograd graph on a single node is not necessarily connected because of inter-node communication. As a result, we have special handling to ensure the local autograd engine ensures we execute the entire graph starting from the provided roots and all 'send' functions on the node. 4. When the local autograd engine hits a 'recv' function, it performs an async RPC to send the gradients over to the appropriate node and stores a future in the autograd context to keep track of this RPC. 5. On the destination node, the appropriate 'send' function is looked up and enqueued on the local autograd engine. If this is the first time the node is hearing about this autograd context id on the backward pass, then the node computes dependencies for the local autograd engine. 6. As part of compute dependencies, the distributed autograd engine discovers all leaf nodes and ensures those are passed as 'outputs' to the local autograd engine. This avoids running the 'AccumulateGrad' function. 7. The gradients computed for the leaf nodes are then actually accumulated in `DistAutogradContext` for the appropriate autograd context id. 8. The distributed autograd engine waits for the local autograd engine to complete and also waits for all the 'Futures' (stored in 4.) for respective RPCs to finish. We have made the following changes to the local autograd engine for this purpose: 1. Expose GraphTask and NodeTask so that the distributed autograd engine can use them. 2. Expose a `execute_with_graph_task` API which gives the distributed engine to build a GraphTask and pass it to the local autograd engine. 3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build a `NodeTask` for a 'send' function and enqueue it on the local autograd engine. In addition to this a few general improvements: 1. Added a `PropagateGradients` RPC call for the 'recv' function to pass gradients to the appropriate node during the backward pass. 2. Use IValues as much as possible in serialization for RpcWithAutograd. 3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate exception instead of just returning the message. This is inline with what most Future.wait() APIs do. 4. Added a `get_gradients(context_id)` API which allows users to retrieve a map from Tensor to respective gradient for the provided context_id on the local node. ghstack-source-id: 91794926 Test Plan: unit tests. Differential Revision: D17652615 fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3	2019-10-12 09:47:49 -07:00
Shen Li	2486b0ba82	Add Python RRef as args and return value (#25499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499 See #23110 for model parallel design details, and #26759 for the RRef protocol. This commit add support for using RRef as Python UDF arguments and return value. RRefs can now be shared from owner to user, from user to owner, or from user to user. Limitations: 1. No implicit type conversion yet. (#27099) 2. No failure handling and retry. (#26116) 3. UDF is not yet blocked until all RRefs are confirmed. (#27098) 4. Internal RRef control messages are not idempotent yet. (#26116) 5. Cannot delete RRefs correctly when there are circular dependencies. (#27096) Main changes: 1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations. 2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages. 3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`. 4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure. 5. Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs. 6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`. Test Plan: Imported from OSS buck test mode/dev-nosan //caffe2/test:rpc_fork Differential Revision: D17184146 Pulled By: mrshenli fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265	2019-10-03 17:47:12 -07:00
Pritam Damania	fe4170bda8	Add send and recv backward functions for builtin operators RPC. (#25527 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527 Master GH issue: https://github.com/pytorch/pytorch/issues/23110. This change builds upon https://github.com/pytorch/pytorch/pull/24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. ghstack-source-id: 91240466 Test Plan: unit tests. Differential Revision: D17148077 fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233	2019-10-03 01:18:46 -07:00
Yanli Zhao	631e2ee7a4	make python udf serialization format to be binary plus tensor tables (#27136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136 make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators ghstack-source-id: 91156141 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D17405686 fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd	2019-10-02 00:10:32 -07:00
Rohan Varma	1f51051287	remove extra get_worker_id call in distributed rpc init (#26381 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26381 Was looking through this definition and saw that it has 2 identical definitions of get_worker_id. Tested by ensuring that all tests in `test/test_rpc.py` still pass. ghstack-source-id: 90347452 Test Plan: See above Differential Revision: D17439495 fbshipit-source-id: 9a78340f7aefa5797e0ae837fbcfe24ebe3a775d	2019-09-18 16:34:54 -07:00
Shen Li	197fd4f707	Adding RRef as return value for builtin operators (#25169 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25169 See #23110 for RRef design details. This commit only implements RRef as return value for builtin operators, and RRef will communicate between a user and the owner. More specifically, a RRef is first created on the `dist.remote` caller, which is a user of the RRef. Then the RRef user sends and notification to the owner to report the fork to the owner, and the owner uses a shared_ptr to keep the RRef alive. When the user RRef is destructed on the caller, another notification will be sent to the owner, and the owner can then drop it's RRef as well. Test Plan: Imported from OSS Differential Revision: D17048343 Pulled By: mrshenli fbshipit-source-id: 9dd3b3d0e4fd214c76fecdbed746a6d3029b3efd	2019-09-05 15:14:17 -07:00
Pieter Noordhuis	5407241b4f	Run clang-format on torch/csrc/distributed (#25647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25647 TSIA Test Plan: N/A Differential Revision: D17182909 fbshipit-source-id: 22a6554693def0032a051cef5fe788f49de1d740	2019-09-04 10:08:09 -07:00
Shen Li	c881136215	Move worker name collection code from Python to C++ (#24260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24260 This also simplifies ProcessGroupAgent constructor signature. Test Plan: Imported from OSS Differential Revision: D16789219 Pulled By: mrshenli fbshipit-source-id: bbb69022435467fbb1c28da21dd03d3ab52fc521	2019-08-31 19:02:45 -07:00
Shen Li	1294e55c15	Assign each RpcAgent a unique ID, and use ID for sending RPC messages. (#24195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24195 It is not efficient to use a string destination name in every send. Moreover, when we add RRef later, RpcAgent will frequently check RRef ownership. It will be slow as well if we have to go though string comparison every time. This commit assigns each RpcAgent a unique integer ID. In the Python send API, applications can provide either destination name or id. If it is a string name, it will be converted to id by calling the get_id(workerName) API. Test Plan: Imported from OSS Differential Revision: D16770241 Pulled By: mrshenli fbshipit-source-id: fa56128a77a02a402dc6682474bc301dc1b7f43d	2019-08-29 19:19:11 -07:00
Pritam Damania	7818e7e5d4	Basic framework for Distributed Autograd context. (#24875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24875 As per https://github.com/pytorch/pytorch/issues/23110, each autograd pass would be assigned a unique autograd_context_id. In this change we introduce a DistAutogradContainer per worker which holds information for each autograd pass currently running. DistAutogradContainer has a map from the autograd_context_id to DistAutogradContext (which holds all the relevant information for the autograd pass). DistAutogradContext currently only stores the autograd_context_id and more information would be added to it later as we build out the rest of the framework. The autograd_context_id is a 64 bit globally unique integer where the first 16 bits are the worker_id and next 48 bits are auto-incrementing for uniqueness. Sample python code on how this would be used for distributed autograd: ``` import torch.distributed.autograd as dist_autograd worker_id = 0 dist_autograd.init(worker_id) with dist_autograd.context() as context_id: # forward pass... # backward pass... # optimizer step... ``` ghstack-source-id: 89119248 Test Plan: unit tests. Differential Revision: D16356694 fbshipit-source-id: d1a8678da0c2af611758dbb5d624d554212330ce	2019-08-28 18:51:56 -07:00
Shen Li	b6803d62fd	Use snake names for all files in distributed.rpc (#24502 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24502 Files in distributed.rpc package mixes snake camel names. This commit cleans that up and all files use snake names now. ghstack-source-id: 88548990 Reviewed By: xush6528 Differential Revision: D16860155 fbshipit-source-id: 3a22a89bf6c4e11aac5849564fc53296a04d6a8b	2019-08-19 10:58:59 -07:00
Shen Li	99dea08e60	Use c10::ThreadPool to send and receive messages (#23968 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 88476246 Differential Revision: D16695091 fbshipit-source-id: fd18a5c65e7fcd1331b73d1287673e6e10d2dd86	2019-08-16 17:49:05 -07:00
Yanli Zhao	ab39a55331	python udf over rpc (#23569 ) Summary: This diff is to support python user defined function over rpc for https://github.com/pytorch/pytorch/issues/23110, work flow is like this: 1. pickle python udf 2. pass pickle to C++ 3. C++ pass over rpc from client to server 4. server call runPythonUDF() python function to unpickle and run python udf and pickle the udf result using python embedder 6. pass back serialized result from server to client 7. client call loadPythonUDFResult() python function to unpickle result 7. return it to python right now, put rpc_sync_builtin() and rpc_async_builtin() as temporary interfaces for builtin operator remote calls, they accept qualified name string, this interface can execute builtin operators in C++ land. rpc_sync() and rpc_async() accept python callables only right now, it could be user define python functions or builtin operator python functions, the python functions will be executed in python land. once we can resolve builtin operator python callables to qualified name string, we can merge rpc_sync_builtin() into rpc_sync() then Pull Request resolved: https://github.com/pytorch/pytorch/pull/23569 Test Plan: unit tests Differential Revision: D16390764 Pulled By: zhaojuanmao fbshipit-source-id: 2cf2c22a979646830b5581bd75eabf8b3cca564c	2019-08-14 23:13:33 -07:00
Shen Li	8b349073ce	sync and async torch.distributed.rpc for builtin operators (#23228 ) Summary: Features: * sync and async RPC for builtin operators * RpcAgent API * ProcessGroupAgent implementation Goal: * have a minimum working and testable RPC implementation * make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation * For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object. * For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...). * support blocking and non-blocking RequestCallback * blocking means the callback won't return before sending out the response * non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list. We are not exporting this diff until we finalize distributed autograd design and publish the API review publicly. https://fb.quip.com/FabTAZKVgQpf Pull Request resolved: https://github.com/pytorch/pytorch/pull/23228 ghstack-source-id: 87816717 Reviewed By: zhaojuanmao Differential Revision: D15194693 fbshipit-source-id: 7adb600796613cde6073db6c227451b89940ecaf	2019-08-06 16:03:01 -07:00

30 Commits