pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xu Zhao	eaa993a2e0	Add type annotations to torch._C._distributed_rpc module. (#46624 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46624 Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D24761656 Pulled By: xuzhao9 fbshipit-source-id: b55aee5dd2b97f573a50e5bbfddde7d984943fec	2020-11-06 01:28:51 -08:00
Shen Li	8cb7280242	Revert "Remove device maps from TensorPipe for v1.7 release (#45353 )" (#45762 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45762 This reverts commit `5211fb97ac`. Test Plan: Imported from OSS Reviewed By: colesbury Differential Revision: D24088231 Pulled By: mrshenli fbshipit-source-id: b6ee15ec5ae137ea127bdc2db8e1842764bc01d4	2020-10-02 15:14:05 -07:00
Shen Li	5211fb97ac	Remove device maps from TensorPipe for v1.7 release (#45353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353 Temporarily removing this feature, will add this back after branch cut. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D23939865 Pulled By: mrshenli fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e	2020-09-25 16:51:45 -07:00
Luca Wehrstedt	76dc50e9c8	[RPC] Infer backend type if only options are given (#45065 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45065 To preserve backwards compatibility with applications that were passing in some ProcessGroupRpcBackendOptions but were not explicitly setting backend=BackendType.PROCESS_GROUP, we're here now inferring the backend type from the options if only the latter ones are passed. If neither are passed, we'll default to TensorPipe, as before this change. ghstack-source-id: 112586258 Test Plan: Added new unit tests. Reviewed By: pritamdamania87 Differential Revision: D23814289 fbshipit-source-id: f4be7919e0817a4f539a50ab12216dc3178cb752	2020-09-23 00:46:27 -07:00
Shen Li	09e7f62ce2	Fix RPC and ProcessGroup GIL deadlock (#45088 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45088 Fixes #45082 Found a few problems while working on #44983 1. We deliberately swallow RPC timeouts during shutdown, as we haven't found a good way to handle those. When we convert `_wait_all_workers` into `_all_gather`, the same logic was inherited. However, as `_all_gather` meant to be used in more general scenarios, we should no longer keep silent about errors. This commit let the error throw in `_all_gather` and also let `shutdown()` to catch them and log. 2. After fixing (1), I found that `UnpickledPythonCall` needs to acquire GIL on destruction, and this can lead to deadlock when used in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a synchronization point which holds GIL. In `init_rpc`, followers (`rank != 0`) can exit before the leader (`rank == 0`). If the two happens together, we could get a) on a follower, it exits `init_rpc` after running `_broadcast_to_followers` and before the reaching dtor of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`, which holds the GIL and wait for the leader to join. However, the leader is waiting for the response from `_broadcast_to_followers`, which is blocked by the dtor of `UnpickledPythonCall`. And hence the deadlock. This commit drops the GIL in `ProcessGroup` ctor. 3. After fixing (2), I found that `TensorPipe` backend nondeterministically fails with `test_local_shutdown`, due to a similar reason as (2), but this time it is that `shutdown()` on a follower runs before the leader finishes `init_rpc`. This commit adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`. The 3rd one should be able to solve the 2nd one as well. But since I didn't see a reason to hold GIL during `ProcessGroup` ctor, I made that change too. Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D23825592 Pulled By: mrshenli fbshipit-source-id: 94920f2ad357746a6b8e4ffaa380dd56a7310976	2020-09-21 21:47:27 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Shen Li	5006d24302	Make TensorPipe the default backend for RPC (#43246 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43246 Test Plan: Imported from OSS Reviewed By: osalpekar Differential Revision: D23206042 Pulled By: osalpekar fbshipit-source-id: 258481ea9e753cd36c2787183827ca3b81d678e3	2020-08-20 14:17:02 -07:00
Shen Li	06aaf8c20d	Add set_device_map to TensorPipeOptions to support GPU args (#42637 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637 This commit enables sending non-CPU tensors through RPC using TensorPipe backend. Users can configure device mappings by calling set_map_location on `TensorPipeRpcBackendOptions`. Internally, the `init_rpc` API verifies the correctness of device mappings. It will shutdown RPC if the check failed, or proceed and pass global mappings to `TensorPipeAgent` if the check was successful. For serde, we added a device indices field to TensorPipe read and write buffers, which should be either empty (all tensors must be on CPU) or match the tensors in order and number in the RPC message. This commit does not yet avoid zero-copy, the tensor is always moved to CPU on the sender and then moved to the specified device on the receiver. Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D23011572 Pulled By: mrshenli fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187	2020-08-14 18:46:55 -07:00
Luca Wehrstedt	2393bab036	[TensorPipe] Update documentation (#40222 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40222 Mention the TensorPipe agent in the RPC docs and give users the information they need to choose which agent to use. ghstack-source-id: 106225711 Test Plan: Export to GitHub, build locally and try out the docs. Differential Revision: D22116494 fbshipit-source-id: 30703ba8410c40f64e785f60d71dfd9faa8de4a1	2020-06-19 04:26:49 -07:00
Luca Wehrstedt	7c9e78fdf5	[TensorPipe] Add options for agent, including backend killswitches (#40162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40162 The only public option is `num_worker_threads`. The other ones are private (as indicated by the leading underscore, is that enough?) and allow to specify a different set and order of transports/channels. These can thus be used to disable a backend (by not specifying it) or by forcing one (by raising its priority). They can therefore be used to work around defective backends, in case we'll find any post-release. ghstack-source-id: 106103238 Test Plan: Built //caffe2:ifbpy and, using TensorPipe's verbose logging, verified that the transports/channels I specified were indeed the ones that were being registered. Differential Revision: D22090661 fbshipit-source-id: 789bbe3bde4444cfa20c40276246e4ab67c50cd0	2020-06-18 02:54:17 -07:00
Pritam Damania	145df306ae	Avoid using default process group in ProcessGroupAgent. (#39909 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39909 As described in https://github.com/pytorch/pytorch/issues/33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: https://github.com/pytorch/pytorch/issues/33583 ghstack-source-id: 105953303 Test Plan: waitforbuildbot Differential Revision: D22011868 fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f	2020-06-16 12:00:29 -07:00
Luca Wehrstedt	54046c1024	[TensorPipe] Implement join correctly (#38933 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38933 Based on what I could understand from how the RPC shutdown operates and from what the ProcessGroup agent does, the join method is supposed to act as a barrier among all workers that waits until they all have finished all their pending work, including work that may be triggered by nested calls or by callbacks. ghstack-source-id: 104760684 Test Plan: Before this diff, the `test_user_rrefs_confirmed` test of the RPC suite was flakily deadlocking. After this, I haven't been able to repro that. Differential Revision: D21703020 fbshipit-source-id: 3d36c6544f1ba8e17ce27ef520ecfd30552045dd	2020-05-28 10:48:13 -07:00
Luca Wehrstedt	91f451a5e6	[TensorPipe] Do not require user to provide worker name-to-rank map (#38052 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38052 The initial version of the TensorPipe agent required the user to specify the full map between workers' names and their ids, on each worker. However it's enough for each worker to just specify their name and id, as these can then be exchanged using the store. Addresses #37784, although I think we can go further and use the store to also automatically assign ranks to workers, so that the user only needs to specify a name. ghstack-source-id: 103741595 (Note: this ignores all push blocking failures!) Test Plan: On worker 0: ``` In [1]: import os ...: import torch ...: import torch.distributed.rpc as rpc ...: os.environ["MASTER_ADDR"] = "127.0.0.1" ...: os.environ["MASTER_PORT"] = "8765" In [2]: rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2) In [3]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2))) Out[3]: tensor([[3., 3.], [3., 3.]]) In [4]: rpc.rpc_sync("bar", torch.add, args=(1, 2)) Out[4]: 3 ``` On worker 1: ``` In [1]: import os ...: import torch ...: import torch.distributed.rpc as rpc ...: os.environ["MASTER_ADDR"] = "127.0.0.1" ...: os.environ["MASTER_PORT"] = "8765" In [2]: rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2) ``` Then also tested by adding `rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method="file:///tmp/init/foo")` to `rpc_init`. Differential Revision: D21463833 fbshipit-source-id: b53d7af6fc060789358ac845aa1898ddea6e8f31	2020-05-08 10:48:48 -07:00
Hongyi Jia	0549e1f384	[Tensorpipe/RPC] tensorpipe RPC agent (#35483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35483 Implement the initial version of TensorPipe RPC agent, and register to RPC registry to expose to Python interface. As a starter, it utilizes all available TensorPipe transports (shm, uv) and channels (basic, cma). Test Plan: https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/experimental/jiayisuse/tensorpipe_rpc export MASTER_ADDR=127.0.0.1 export MASTER_PORT=28500 buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:main ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/main.par buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:benchmark ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/benchmark.par Multiple connections with async echo ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/async_echo.par Reviewed By: lw Differential Revision: D20088366 fbshipit-source-id: 980f641af3321ca93583c62753e1c9174b7d4afc	2020-05-05 05:47:43 -07:00
Rohan Varma	c0a985fcd6	Allow customizing retryable message types in Faulty agent tests (#37450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450 It doesn't seem like we could customize the retryable message types by passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture` overrode the `rpc_backend_options` function and provided the default list of retryable message types. Needed to fix this as part of adding timeout injection support as mentioned in https://github.com/pytorch/pytorch/issues/36272 ghstack-source-id: 103287164 Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details` Differential Revision: D21270127 fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa	2020-05-01 12:00:36 -07:00
Rohan Varma	4ff4119d45	[rpc] Move _set_rpc_backand and RpcBackendOptions to use float instead of timedelta (#37027 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027 The RPC timeout passed into rpc_sync and rpc_async after the below change is now float, so we should make these APIs consistent. ghstack-source-id: 102971906 Test Plan: Existing unittests, also added unittest testing specific timeout set in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling. Differential Revision: D21125171 fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1	2020-04-27 19:38:06 -07:00
Shen Li	f1085a8e41	Improve ProcessGroup RpcBackendOptions Constructor API (#34081 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34081 Before this commit, applications have to do the following to configure number of threads in ProcessGroup RPC backend: ``` op = ProcessGroupRpcBackendOptions() op.rpc_timeout = rpc_timeout op.init_method = init_method op.num_send_recv_threads = 32 init_rpc(...., rpc_backend_options=op) ``` After this commit, it can be simplified to: ``` init_rpc(...., rpc_backend_options=ProcessGroupRpcBackendOptions(num_send_recv_threads=32)) ``` Fixes #34075 Test Plan: Imported from OSS Differential Revision: D20227344 Pulled By: mrshenli fbshipit-source-id: def4318e987179b8c8ecca44d7ff935702c8a6e7	2020-03-03 16:43:29 -08:00
Rohan Varma	eb9b4b1f29	handle errors in ProcessGroupAgent::listenLoop(). (#32957 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32957 Closes https://github.com/pytorch/pytorch/issues/29703. If there is a gloo timeout and `recvWork->wait()` times out in `listenLoop()`, processGroupagent crashes since there is an unhandled exception in a thread. This catches the exception and exits the listen loop. In a follow up diff, we will enhance these error conditions so that if users attempt to send RPCs again, they are notified that the RPC agent was in a bad state and it was shutdown. This PR also adds a new option, `processGroupTimeout` to PG agent's backend options. This allows us to control the gloo timeout. ghstack-source-id: 98236783 Test Plan: Added a unit test. Differential Revision: D19678979 fbshipit-source-id: 3895ae754f407b84aca76c6ed3cb087d19178c40	2020-02-13 14:50:05 -08:00
Rohan Varma	5c6705e62c	add default arg for init_method (#30208 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208 Adds default arg for init_method so users don't have to pass this in, and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs. ghstack-source-id: 94500475 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18630074 fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a	2019-11-25 14:52:48 -08:00
Rohan Varma	f41422121e	default construct rpc agent options based on the backend type (#30201 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201 Provide a default constructor so that users don't have to construct RPC agent options. Also rename this to RPCBackend Options as suggested. ghstack-source-id: 94411768 Test Plan: Unit tests pass. Differential Revision: D18628698 fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446	2019-11-22 08:18:06 -08:00
Shen Li	fea963d3ae	Fix BackendType repr in doc (#30243 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30243 Before this commit, rpc docs shows init_rpc as the following: ``` torch.distributed.rpc.init_rpc( name, backend=<BackendType.PROCESS_GROUP: BackendValue( construct_rpc_agent_options_handler=<function _process_group_construct_rpc_agent_options_handler>, init_backend_handler=<function _process_group_init_backend_handler>)>, init_method=None, rank=-1, world_size=None, rpc_agent_options=None ) ``` It unnecessarily leaks implementation details. This commit adds a __repr__ function to BackendType Enum class to address this problem. closes #29905 Test Plan: Imported from OSS Differential Revision: D18641559 Pulled By: mrshenli fbshipit-source-id: 19bf8a2d21c8207f026d097d8e3f077578d53106	2019-11-21 16:22:43 -08:00
Shihao Xu	80e3f17301	Resubmit "Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s" (#30093 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093 https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. ghstack-source-id: 94197295 Test Plan: ### OSS RPC + RRef tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc ``` ### Prototype RRef tests ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent ``` ### Dist autograd ``` buck test mode/dev-nosan caffe2/test:dist_autograd_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test ``` Differential Revision: D18595578 fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065	2019-11-19 18:52:30 -08:00
Edward Yang	1dda8186ae	Revert D18549919: Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s Test Plan: revert-hammer Differential Revision: D18549919 Original commit changeset: b9f3f1a41d1f fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee	2019-11-19 08:14:40 -08:00
Shihao Xu	21dc1d4543	Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s (#29972 ) Summary: https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. closes https://github.com/pytorch/pytorch/issues/29031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972 Differential Revision: D18549919 Pulled By: xush6528 fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b	2019-11-19 01:00:08 -08:00
Rohan Varma	639133d6d1	rename init_model_parallel to init_rpc (#29762 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762 Rename this API as discussed, since it's use cases extend beyond only model parallelism. ghstack-source-id: 94020627 Test Plan: Unit tests pass Differential Revision: D18491743 fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57	2019-11-18 06:07:44 -08:00
Rohan Varma	06ef4a757d	Add docs for RPC, dist autograd, and RRef modules (#29276 ) Summary: Closes https://github.com/pytorch/pytorch/issues/28983. Documentation for `torch.distributed.rpc` and `torch.distributed.autograd` modules. Also fixes/tidies up some of the docstrings in rpc/autograd, and moves some functions to be private so they don't show up in the documentation. Note: Much of the text to describe/explain the RPC/RRef layers are taken from the following RFCs: https://github.com/pytorch/pytorch/issues/23110, https://github.com/pytorch/pytorch/issues/26759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29276 Differential Revision: D18478754 Pulled By: rohan-varma fbshipit-source-id: e9a7089baf5275304e5408d319eb9bf98e53fff8	2019-11-14 14:32:03 -08:00
Shihao Xu	e66626ae5c	Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. (#29341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341 So that other RpcAgent could use this timeout setting as well. ghstack-source-id: 93481902 Differential Revision: D5681951 fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc	2019-11-07 17:05:45 -08:00
Pieter Noordhuis	b4df413712	Scope pybind11 functions to torch.distributed.{autograd,rpc} Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27529 Test Plan: Imported from OSS Differential Revision: D17808209 Pulled By: pietern fbshipit-source-id: 1e3e086085167320c3fc369467f5d75ce39fa4ea	2019-11-05 06:25:22 -08:00
Rohan Varma	fd0f9811ad	add timeout for RPC futures, and ability to set timeout when initializing rpc (#28392 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392 Per #25531, we want to clean up futures when we detect that there are failures/timeouts. As a first step, this diff adds timers to the future object, provides functionality to check if a future is timed out, and allows specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out. ghstack-source-id: 93192622 Test Plan: Added unit tests. Differential Revision: D18025163 fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33	2019-11-04 14:43:03 -08:00
Shihao Xu	8f1564b8ab	Add enum type to rpc registry for consolidating RPC initialization code path (#28628 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28628 Consolidate code paths of ProcessGroupAgent construction and other RPC Backend construction. ghstack-source-id: 92845348 Differential Revision: D5516188 fbshipit-source-id: 151d9b7b74f68631d6673fecc74dec525949b8f0	2019-10-29 17:26:15 -07:00
Pieter Noordhuis	14f1629c4d	Move RPC backend registry to torch.distributed.rpc Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27288 Test Plan: Imported from OSS Differential Revision: D17808215 Pulled By: pietern fbshipit-source-id: 489c031e02cd3141a861cf7ec2273aaa4c55b7d7	2019-10-08 11:31:16 -07:00

31 Commits