pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Rohan Varma	1350b99de4	Add local shutdown to process group agent (#30330 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30330 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The API is changed so that the previous `wait_all_workers` does not destroy the agent, and this is now done in a new `shutdown` method. All callsites are updated appropriately. ghstack-source-id: 94673884 ghstack-source-id: 94673884 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18661775 fbshipit-source-id: 5aaa7c14603e18253394224994f6cd43234301c2	2019-11-27 22:34:08 -08:00
Shen Li	efe1859ad9	By default ignore RRef leaks during shutdown (#30217 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30217 Before this commit, RRefContext throws an error if it detects any RRef leak during shutdown. However, this requires applications to make sure that is has freed all references to RRefs in application code, which can be a bad debugging experience when for large applications. Besides, this also relies on Python GC to free things up in time, which might not always be true. After this commit, RRefContext would ignore leaking RRefs during shutdown, as shutdown is called when the application has finished training and no longer care about local states. Hence, it should be OK to just ignore those leaks and destroy OwnerRRefs. If application would like to enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak to False. Test Plan: Imported from OSS Differential Revision: D18632546 Pulled By: mrshenli fbshipit-source-id: 2744b2401dafdd16de0e0a76cf8e07777bed0f38	2019-11-26 06:53:58 -08:00
Rohan Varma	5c6705e62c	add default arg for init_method (#30208 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30208 Adds default arg for init_method so users don't have to pass this in, and moves it to `RpcBackendOptions` struct. Removes `init_method` arg from rpc.init_rpc. Also fixes some docs. ghstack-source-id: 94500475 Test Plan: Unit tests pass. Reviewed By: mrshenli Differential Revision: D18630074 fbshipit-source-id: 04b7dd7ec96f4c4da311b71d250233f1f262135a	2019-11-25 14:52:48 -08:00
Shihao Xu	6a00191fc2	Add RpcAgent::getWorkerInfos() (#30241 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30241 We need an API to get all worker infos. This will be used by backend-agnostic `rpc.wait_all_workers()` API. ghstack-source-id: 94454935 Test Plan: # Unit tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_get_worker_infos buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_get_worker_infos ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_get_worker_infos buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_get_worker_infos ``` Differential Revision: D5693412 fbshipit-source-id: 5123c8248b6d44fd36b8a5f381dbabb2660e6f0f	2019-11-22 18:26:30 -08:00
Shen Li	a9f3f48f88	Revert D5578006: Add local shutdown to process group agent Test Plan: revert-hammer Differential Revision: D5578006 Original commit changeset: 6258879fb44c fbshipit-source-id: 11b893b3a280a8383eeb20a0548626811616dca1	2019-11-22 11:31:04 -08:00
Rohan Varma	c478a92b93	Add local shutdown to process group agent (#30020 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30020 This is now possible due to previous changes made in `gloo` and `ProcessGroupGloo`. We `abort` the listener thread that is waiting for a message, and join all other threads. The destructor calls this same `localShutdown` method, but we ensure this is not called multiple times. ghstack-source-id: 94415336 Test Plan: Unit tests pass. Differential Revision: D5578006 fbshipit-source-id: 6258879fb44c9fca97fdfad64468c1488c16ac02	2019-11-22 10:03:00 -08:00
Rohan Varma	f41422121e	default construct rpc agent options based on the backend type (#30201 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30201 Provide a default constructor so that users don't have to construct RPC agent options. Also rename this to RPCBackend Options as suggested. ghstack-source-id: 94411768 Test Plan: Unit tests pass. Differential Revision: D18628698 fbshipit-source-id: 81fb45f124ad1006e628f6045162308093c9d446	2019-11-22 08:18:06 -08:00
Shen Li	4609c626c5	Enable test_call_method_on_rref in rpc_test (#30261 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30261 With #29827, the flakiness should disappear for test_call_method_on_rref Test Plan: Imported from OSS Differential Revision: D18645036 Pulled By: mrshenli fbshipit-source-id: 44d759062fc78b1a797266096dbb4ddd104f07eb	2019-11-21 19:38:19 -08:00
Wen Zhang	6e4c23b02f	Add RPC internal helper that overrides the default pickler. (#30185 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30185 To enable share_memory over RPC, add an internal helper that overrides the default RPC pickler. Replace D18598974 ghstack-source-id: 94299660 Test Plan: `python test/test_rpc_spawn RpcTestWithSpawn.test_use_rpc_pickler` `buck test mode/dev-nosan //caffe2/test:rpc_spawn -- test_use_rpc_pickler` Reviewed By: mrshenli Differential Revision: D18621372 fbshipit-source-id: c680ef711b2c42524c47a5266e911fa8e0cd45ae	2019-11-21 10:01:02 -08:00
Rohan Varma	f304bd5062	rename join_rpc to wait_all_workers in public api (#30050 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30050 Renames this API to wait_all_workers as discussed. ghstack-source-id: 94273005 Test Plan: Unit tests pass Differential Revision: D18581466 fbshipit-source-id: 4ff5d5fb2d528f17252d5b5f30c3047d2efb92bf	2019-11-20 12:38:35 -08:00
Yanli Zhao	b410d864c9	make python remote exception to rethrow when using remote reference to itself (#29930 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29930 Right now, python call remote exception rethrown is coupled with deserializtiaon. For owner ref, the setValue() and getValue() do not use serialization and deserialization, so when users create a ref to itself, and call ownerRef.to_here(), python call remote exception will not be rethrown. This diff is to move remote exception rethrown out of deserialization, and exception can be handled for ownerRef.localValue() or ownerRef.to_here() close #29924 ghstack-source-id: 94210894 Test Plan: unit tests Differential Revision: D18541916 fbshipit-source-id: 7cda93f623d52c740b3c1b1fa9a442f866984340	2019-11-19 21:33:21 -08:00
Shihao Xu	80e3f17301	Resubmit "Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s" (#30093 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30093 https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. ghstack-source-id: 94197295 Test Plan: ### OSS RPC + RRef tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_rpc_fork_test -- test_sync_rpc ``` ### Prototype RRef tests ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc_thrift_rpc_agent ``` ### Dist autograd ``` buck test mode/dev-nosan caffe2/test:dist_autograd_fork ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcBackend/test:thrift_dist_autograd_fork_test ``` Differential Revision: D18595578 fbshipit-source-id: 616fca3b844c171ed5277bbc6a2b1693bc3a8065	2019-11-19 18:52:30 -08:00
Shihao Xu	868cb05a30	Resubmit "Add RpcAgentTestFixture to extract duplicate code" (#30092 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30092 There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class. ghstack-source-id: 94196891 Test Plan: ### RPC + RRef ``` buck test mode/dev-nosan //caffe2/test:rpc_fork buck test mode/dev-nosan //caffe2/test:rpc_spawn ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift ``` ### Dist Autograd ``` buck test mode/dev-nosan //caffe2/test:dist_autograd_fork buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn ``` ``` buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift ``` ### Dist Optimizer ``` buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn ``` ``` buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift ``` Differential Revision: D18595408 fbshipit-source-id: 8360759c63e838fb19d4eb1aeacca0bf8eb4b55f	2019-11-19 16:24:51 -08:00
Shen Li	5aa50c7f3c	Enable test_nested_rref in rpc_test.py (#30100 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30100 As after #29827 we only test RPC using spawn, the multi-thread/fork error should disappear. Test Plan: Imported from OSS Differential Revision: D18597002 Pulled By: mrshenli fbshipit-source-id: 64aa6a59248e5d1b7e1ad1aebffb6a25248388d2	2019-11-19 13:28:05 -08:00
Shen Li	a243e0872e	Enable test_nested_remote in rpc_test.py (#30099 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30099 As after #29827 we only test RPC using spawn, the multi-thread/fork error should disappear. Test Plan: Imported from OSS Differential Revision: D18597003 Pulled By: mrshenli fbshipit-source-id: ebfb1f6f3f961d98351e06ce4b951793a9b95398	2019-11-19 13:28:01 -08:00
Shen Li	8912e6caf5	Enable test_nested_rpc in rpc_test.py (#30098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30098 As after #29827 we only test RPC using spawn, the multi-thread/fork error should disappear. Test Plan: Imported from OSS Differential Revision: D18597001 Pulled By: mrshenli fbshipit-source-id: 68256289085fac1a9ca76d5b4882e97e2f81d1f4	2019-11-19 13:27:57 -08:00
Edward Yang	7d287688eb	Revert D5689636: Add RpcAgentTestFixture to extract duplicate code Test Plan: revert-hammer Differential Revision: D5689636 Original commit changeset: f35eea1359ad fbshipit-source-id: 31928fce5e96b3beceefbc9a03f54769f10b7e1a	2019-11-19 08:14:44 -08:00
Edward Yang	1dda8186ae	Revert D18549919: Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s Test Plan: revert-hammer Differential Revision: D18549919 Original commit changeset: b9f3f1a41d1f fbshipit-source-id: 2d5e578d18c0725b59eb99a0e942fbf7fe3341ee	2019-11-19 08:14:40 -08:00
Rohan Varma	83513506c3	poll for timed out futures in process group agent (#29601 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29601 Follow up from https://github.com/pytorch/pytorch/pull/28392. Adds a background thread to `ProcessGroupAgent` that polls for timed out RPCs at a pre-set interval, and marks them as completed with a timeout exception if they have timed out. Also deletes the futures from the corresponding maps `futures_` and `futureTimeouts`. Unit tests are added to ensure that timed out RPCs are appropriately cleaned up. Also adds a `shutdown` variable to process group agent to control the shutting down of this background thread, which can eventually be extended to use for controlling a clean shutdown of process group agent. ghstack-source-id: 94175131 Test Plan: Added unit tests Differential Revision: D18434215 fbshipit-source-id: c48abdb8759fe1447200ec66bb9d4b1c50ec4535	2019-11-19 06:42:04 -08:00
Shihao Xu	21dc1d4543	Add `RpcAgentOptions` struct type, which bundles different required arguments for different `RpcAgent`s (#29972 ) Summary: https://github.com/pytorch/pytorch/pull/28226 introduced `worker_to_id` arg to the `def init_rpc` function for other `RpcAgent`. While it's not really used by `ProcessGroupAgent`. Cleanup is wanted for this, as described in https://github.com/pytorch/pytorch/issues/29031. To adapt to the difference of different `RpcAgent`, adding a `RpcAgentOptions` base classes, which allow leveraging inheritance to add extra fields. closes https://github.com/pytorch/pytorch/issues/29031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/29972 Differential Revision: D18549919 Pulled By: xush6528 fbshipit-source-id: b9f3f1a41d1ff18498734081870820b055d56f5b	2019-11-19 01:00:08 -08:00
Alisson Gusatti Azzolini	97156f548d	Add hash and equality operators for WorkerInfo (#29958 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29958 DistributedOptimizer relies on hashing WorkerInfo in order to coalesce fan-out RPCs. This will likely be a very common use case (EASGD will do the same, for example). ghstack-source-id: 94169198 Test Plan: unit test. Differential Revision: D18548257 fbshipit-source-id: 7d67d4e1b9bc60403c372164982a75ae8c1d8389	2019-11-18 20:47:13 -08:00
Pritam Damania	df6a1c0437	Remove rpc.sync_rpc from the public API. (#30033 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30033 Removing this API for now since we don't have a concrete use-case for this yet and as a result exposing this as a public API might result in users depending on this API. We can always add some variant of this API back if needed later. ghstack-source-id: 94138302 Test Plan: waitforbuildbot Differential Revision: D18578056 fbshipit-source-id: 078c62331725e03bd5702624afc16b1cdcdf26a4	2019-11-18 18:02:07 -08:00
Shihao Xu	8dd67057f1	Add RpcAgentTestFixture to extract duplicate code (#29747 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29747 There are duplicate code for component that rely on RpcAgent. Extract them into a re-usable test fixture class. Test Plan: ### RPC + RRef ``` buck test mode/dev-nosan //caffe2/test:rpc_fork buck test mode/dev-nosan //caffe2/test:rpc_spawn ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift ``` ### Dist Autograd ``` buck test mode/dev-nosan //caffe2/test:dist_autograd_fork buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn ``` ``` buck test mode/dev-nosan //caffe2/test:dist_autograd_fork_thrift buck test mode/dev-nosan //caffe2/test:dist_autograd_spawn_thrift ``` ### Dist Optimizer ``` buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn ``` ``` buck test mode/dev-nosan //caffe2/test:dist_optimizer_fork_thrift buck test mode/dev-nosan //caffe2/test:dist_optimizer_spawn_thrift ``` Differential Revision: D5689636 fbshipit-source-id: f35eea1359addaaac9bd8d00d0a5df228a236511	2019-11-18 12:54:17 -08:00
Rohan Varma	639133d6d1	rename init_model_parallel to init_rpc (#29762 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29762 Rename this API as discussed, since it's use cases extend beyond only model parallelism. ghstack-source-id: 94020627 Test Plan: Unit tests pass Differential Revision: D18491743 fbshipit-source-id: d07676bb14f072c64da0ce99ee818bcc582efc57	2019-11-18 06:07:44 -08:00
Shen Li	4a1fcc0b83	Allow rpc.remote to create RRef on self (#29634 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29634 This implementation supports rpc.remote to self by doing the following steps: 1. create an owner RRef 2. add the owner RRef to owners_ in RRefContext, and keep it alive by using RRefId as the ForkId. 3. Go through serde and insert the message to the caller's thread-pool 4. When the response message gets processed, remove the itself from RRef fork map. Test Plan: Imported from OSS Differential Revision: D18445812 Pulled By: mrshenli fbshipit-source-id: e3b9aa98962c388acbc2ce294101a236d5cb2da6	2019-11-14 00:10:24 -08:00
Shen Li	c49b324cbf	Enable test_stress_light_rpc in rpc_test.py Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29473 Test Plan: Imported from OSS Differential Revision: D18404820 Pulled By: mrshenli fbshipit-source-id: de0f18db208d83794507c162483bb948056af533	2019-11-11 12:22:10 -08:00
Shen Li	bb90c18791	Enable test_py_rref_args_user_share in rpc_test.py Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29472 Test Plan: Imported from OSS Differential Revision: D18404818 Pulled By: mrshenli fbshipit-source-id: 1fcd19b178dc20540a210601cbb2c974be14a7cc	2019-11-11 12:22:05 -08:00
Shen Li	b885eff4be	Enable test_multi_py_udf_remote in rpc_test.py Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29471 Test Plan: Imported from OSS Differential Revision: D18404819 Pulled By: mrshenli fbshipit-source-id: 8cf3e32d7980e34c48bfd8fb61cfd9a0acc9bd46	2019-11-11 12:22:01 -08:00
Shen Li	bc4457f5b6	Enable test_py_built_in in rpc_test.py Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29470 Test Plan: Imported from OSS Differential Revision: D18404822 Pulled By: mrshenli fbshipit-source-id: 01cb87dee39c3579a2e0961d67b627ca1dc87fc2	2019-11-11 12:21:56 -08:00
Alisson Gusatti Azzolini	93b5c9d723	Allow to create local RRef with value (#28948 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28948 Add the constructor RRef(value) in python. This allows to wrap a local object with RRef an pass or return this RRef to users. This enables returning, for example, a list of RRefs containing the parameters of a module to the user of the module. ghstack-source-id: 93565010 Test Plan: unit test. Differential Revision: D18241227 fbshipit-source-id: b9e9b958f40623348d62ee6fc9e7f0414b4215b7	2019-11-11 12:19:45 -08:00
Shen Li	3e5af22650	Disable flaky RPC tests (#29485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29485 The flakiness is likely due to the problem with OMP and fork. We should disable fork tests for good, but that would have negative impact on internal test coverage. This commit disables the most buggy nested tests for now, until we find a way to turn fork test off. Test Plan: Imported from OSS Differential Revision: D18407529 Pulled By: mrshenli fbshipit-source-id: dcbe49a9d104fcf1eaf83107d58904d49dc18aff	2019-11-10 21:33:27 -08:00
Jeremy Lilley	2cd4f86422	Support process_group_agent "sending to itself" (#29253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29253 Some operations can be simpler if a worker can send an rpc to itself. The main reason for not doing previous was that Gloo doesn't support self-sending. That said, this changes the process_group_agent to skip the assert check, and simply enqueue the rpc message in its receiving queue. ghstack-source-id: 93518076 Test Plan: buck test mode/dev-nosan caffe2/test/... Differential Revision: D18339715 fbshipit-source-id: 08ade40e81da378b003a550c898a726e99d50e34	2019-11-08 12:11:55 -08:00
Pritam Damania	5e1983f90f	Fix distributed autograd initialization. (#29069 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29069 Distributed autograd was initialized after RPC and this would cause a race in some scenarios where one node might have initialized distributed autograd, calls backward() but other nodes have not initialized distributed autograd yet. Moving this before `_init_rpc` fixes the problem since `_init_rpc` implicitly has a sync between processes via the store. ghstack-source-id: 93535922 Test Plan: waitforbuildbot Differential Revision: D18280875 fbshipit-source-id: 739a1c22dec21df859738d074e6e497fa43257fd	2019-11-08 11:20:15 -08:00
Shen Li	63675b1969	Revert RRef.to_here()/local_value() return type (#29396 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29396 The return types of RRef.to_here()/local_value() were recently changed to Future, which triggers flakiness as the RRef could be deleted before the future.wait() finishes. While we are still discussing how we'd like to solve it, this commit reverts the return type to stop bleeding in tests. closes #28885 Test Plan: Imported from OSS Differential Revision: D18375571 Pulled By: mrshenli fbshipit-source-id: 354dbf38b15ab804e44fc9968dd30888415c1fab	2019-11-08 08:31:18 -08:00
Shihao Xu	e66626ae5c	Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. (#29341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341 So that other RpcAgent could use this timeout setting as well. ghstack-source-id: 93481902 Differential Revision: D5681951 fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc	2019-11-07 17:05:45 -08:00
Rohan Varma	003cb8595b	skip more flaky rpc tests (#29157 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29157 As reported, these tests are flaky and time out. Skip them while we investigate further. ghstack-source-id: 93287663 Test Plan: CI Differential Revision: D18309204 fbshipit-source-id: 95f0ea5e0c1162b78da412a34db446a01dfc33bf	2019-11-05 15:49:13 -08:00
Shihao Xu	ac027d30d5	Half test time, test_asymmetric_load_with_join, to avoid flakiness (#29139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29139 Each test has 100 sec timeout. Current this test takes 90~110 secs to finish, causing flakiness. Half the load to make it not on the edge of timeout. ghstack-source-id: 93203670 Differential Revision: D5644012 fbshipit-source-id: 2a85999cf1ae6d18e9a871cd76ce194e1ce7b3e8	2019-11-05 14:54:19 -08:00
Rohan Varma	fd0f9811ad	add timeout for RPC futures, and ability to set timeout when initializing rpc (#28392 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28392 Per #25531, we want to clean up futures when we detect that there are failures/timeouts. As a first step, this diff adds timers to the future object, provides functionality to check if a future is timed out, and allows specification of the timeout when initializing rpc. A future diff will check for these timeouts and mark the future completed with an exception indicating that it has timed out. ghstack-source-id: 93192622 Test Plan: Added unit tests. Differential Revision: D18025163 fbshipit-source-id: 195fb50c736caf5c7b2bada9a5f6116bb106ed33	2019-11-04 14:43:03 -08:00
Alisson Gusatti Azzolini	d3cd64d71d	PyRRef.owner() to return WorkerInfo (#28909 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28909 This allows to chain calls on RRef as exemplified in the new test case added. ghstack-source-id: 92996018 Test Plan: unit test. Differential Revision: D18231081 fbshipit-source-id: deeac044ef6d63f18ea241760ac17a3e644cb3d7	2019-10-31 17:11:24 -07:00
Rohan Varma	05e88dc4fe	skip additional flaky rpc tests (#28934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28934 These tests are flaky, skip them as we investigate for a root cause ghstack-source-id: 92945898 Test Plan: tests pass Differential Revision: D18235766 fbshipit-source-id: 9bff65653954b767e32bcc1d25c65b0cea2c4331	2019-10-31 10:12:59 -07:00
Shihao Xu	8f1564b8ab	Add enum type to rpc registry for consolidating RPC initialization code path (#28628 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28628 Consolidate code paths of ProcessGroupAgent construction and other RPC Backend construction. ghstack-source-id: 92845348 Differential Revision: D5516188 fbshipit-source-id: 151d9b7b74f68631d6673fecc74dec525949b8f0	2019-10-29 17:26:15 -07:00
Pritam Damania	1322daa506	Improve error handling for distributed autograd engine. (#27940 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92603377 Test Plan: Added unit tests to test failures. Differential Revision: D17916844 fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0	2019-10-25 12:07:27 -07:00
Shen Li	261a13a84b	Enable dist autograd tests (#28606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28606 Without passing setup_model_parallel=True to dist_init, it the decorator actually takes function object as the value for the flag. Test Plan: Imported from OSS Differential Revision: D18120507 Pulled By: mrshenli fbshipit-source-id: afbaa381647e8f284e28fa9dbdd2a7c411073b3f	2019-10-24 15:30:27 -07:00
Shihao Xu	59402f51cf	Make init_method url appending step re-usable by both init_process_group and init_model_parallel(init_rpc) (#28226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226 # Goal Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`. The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string. We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`. # Solution - Put argument appending inside of `rendezvous` function. - Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function. - Use the `rendezvous` function for any `RpcAgent`. Test Plan: ``` buck test mode/dev-nosan caffe2/test:c10d ``` ``` buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id ``` ``` buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc ``` ``` buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss ``` Differential Revision: D5524494 fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef	2019-10-23 21:51:08 -07:00
Shen Li	e31adeb4f3	Make RRef::LocalValue return Future (#28025 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28025 Add a PyFuture type which is wrapper of either an OwnerRRef or a jit::Future. The difference between PyFuture and jit::Future is that PyFuture can return an custom py::object type. Test Plan: Imported from OSS Differential Revision: D17936746 Pulled By: mrshenli fbshipit-source-id: a7451af3993d98aeab462ffd5318fc6d28f915c8	2019-10-23 17:07:16 -07:00
Shen Li	0ddb50010e	enable test_invalid_names test in rpc_test Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28376 Test Plan: Imported from OSS Differential Revision: D18045158 Pulled By: mrshenli fbshipit-source-id: 42821ef40afbdff8662abacd447e307ccf4853d3	2019-10-21 18:43:37 -07:00
Shihao Xu	3523e5427a	Add master to OSS RPC test (#27776 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27776 I think it’s not worth it to equip other `RPCAgent` with collective communication capability, i.e. 1) have GLOO contained in `RPCAgent`, or 2) Implemented ::barrier() and ::drain() based on RPC messaging. The only use case that does not have a master is the OSS unit test suite, caffe2/test/rpc_test.py. I think having those unit tests to have a master is simpler than equipping `RPCAgent` with collective communication capability. Differential Revision: D5445858 fbshipit-source-id: 56ee24703abd8c5b366829430bef657e0f1dfeba	2019-10-16 13:45:45 -07:00
Shen Li	59cd0faeff	Defer pg agent listener thread until contexts are initialized (#28013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28013 ProcessGroupAgent currently kicks off the listener thread in its constructor. However, serving requests requires contexts to be initialized, e.g., RRefContext and agent_ global var in api.py, which might not be done yet when the first request arrives. ProcessGroupAgent does not know what would be the appropriate time to start the listener thread, hence exposing an API for higher layer code to explicitly start listeners. Test Plan: Imported from OSS Differential Revision: D17932271 Pulled By: mrshenli fbshipit-source-id: 3b408477594d4d19319e7cd08dd6f383a7ed7670	2019-10-15 17:45:43 -07:00
Shihao Xu	871b1419de	Test graceful termination of RPCAgent with asymmetric load (#27761 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27761 # Problem `rpc_test` currently only has test cases that put equal amount of work on every worker node. The problem is that even if the `RpcAgent::sync` is implemented as an empty method. There is no termination misbehavior detected. # Solution At least add one imbalanced-loaded test. ghstack-source-id: 91785984 Differential Revision: D5361435 fbshipit-source-id: 92d1f7cad61b27cdeadc2825ceab6e88d5e4b459	2019-10-15 16:45:21 -07:00
Shen Li	f10ea7a2e1	Add test for requires_process_group_agent decorator Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27879 Test Plan: Imported from OSS Differential Revision: D17924096 Pulled By: mrshenli fbshipit-source-id: 91aaad12daf985768dfb05fb9630cee21a81a366	2019-10-15 06:57:34 -07:00

1 2

67 Commits