pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
mrshenli	2ed3ad2891	fix autodoc for torch.distributed.launch (#40963 ) (#41089 ) Summary: The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line. `542ac74987/torch/distributed/launch.py (L1-L5)` I move it below the docstring to make the autodoc in Sphinx work normally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963 Differential Revision: D22380816 Pulled By: mrshenli fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f Co-authored-by: Xin Yao <yaox12@outlook.com>	2020-07-07 14:23:31 -07:00
mrshenli	4316199832	Add examples and tests for combining static/class method with async execution (#40619 ) (#40688 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40619 Test Plan: Imported from OSS Differential Revision: D22258407 Pulled By: mrshenli fbshipit-source-id: 036d85a2affc4505efd2df197fc513dba010e359	2020-06-29 19:34:23 -07:00
mrshenli	0dc93ac119	[v1.6.0 patch] Install method docstrings from PyRRef to RRef (#40620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461 It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable. Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type. As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11. {F241283111} ghstack-source-id: 106472496 P134031188 Differential Revision: D7933834 fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247 Co-authored-by: Shihao Xu <shihaoxu@fb.com>	2020-06-26 12:15:28 -07:00
Rohan Varma	14f7e95c1a	Add prefix of remote events for RPC profiling (#40066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40066 Builds on top of the previous PR to ensure that all remotely profiled events are prefixed with the key for the RPC that generated them. The key is generated by the result of `_build_rpc_profiling_key` in `rpc/internal.py` and prefixed onto the event name. In order to do this, we set the current-key when creating the RPC in Python, retrieve the currently-set key in C++ and save a GloballyUniqueId -> key mapping to an in-memory map. When we receive an RPC with profiling information, we expect to receive this ID back, and look up the corresponding profiling key in the map. The key is then added to all the remote events. Tested by adding tests to ensure the key is added to all the remote events. Also added a UT which tests in under the multi-threading scenario, to ensure that the mapping's correctness is maintained when several RPCs are in the process of being created at once. ghstack-source-id: 106316106 Test Plan: Unit test Differential Revision: D22040035 fbshipit-source-id: 9215feb06084b294edbfa6e03385e13c1d730c43	2020-06-22 11:01:07 -07:00
Shen Li	314d645e05	Add a warning to mention that async_execution does not work with autograd profiler (#40309 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40309 Test Plan: Imported from OSS Differential Revision: D22145130 Pulled By: mrshenli fbshipit-source-id: d6f7250e53648d6939367f1ad4c9b898be00afed	2020-06-19 15:35:00 -07:00
Shen Li	5d0044389a	Minor RPC doc improvements (#40305 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40305 Test Plan: Imported from OSS Differential Revision: D22144304 Pulled By: mrshenli fbshipit-source-id: 1c8a9648043eabaf909c6e4ae116672396a9f0f5	2020-06-19 15:34:58 -07:00
Shen Li	caf0c286b8	Fix RPC API doc links (#40299 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40299 Test Plan: Imported from OSS Differential Revision: D22143156 Pulled By: mrshenli fbshipit-source-id: c11848ebfe8863d59509a0fbc042eed71a58e514	2020-06-19 15:34:53 -07:00
Shen Li	d6d579397d	Improve docs for init_rpc (#40298 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40298 Test Plan: Imported from OSS Differential Revision: D22143155 Pulled By: mrshenli fbshipit-source-id: deadcc29eda157b401ca6a091c3ba17455acb6b5	2020-06-19 15:34:51 -07:00
Alexander Mols	b7bfdcbe3e	[caffe2/torch] Use logger in jit instantiator Summary: Previously the module would log some data using `print()`. This can be a problem when used in contexts where the process expects to write data to stdout itself. This diff changes the log statements to use `logger` instead. This makes it similar to other log statements in the same module. Test Plan: Confirmed no weird test showed up when running: buck test caffe2/test/distributed/nn/api:remote_module_fork Differential Revision: D22136172 fbshipit-source-id: a3d144eba6c75925ed684981793c84b36eb45a5d	2020-06-19 07:49:15 -07:00
Luca Wehrstedt	2393bab036	[TensorPipe] Update documentation (#40222 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40222 Mention the TensorPipe agent in the RPC docs and give users the information they need to choose which agent to use. ghstack-source-id: 106225711 Test Plan: Export to GitHub, build locally and try out the docs. Differential Revision: D22116494 fbshipit-source-id: 30703ba8410c40f64e785f60d71dfd9faa8de4a1	2020-06-19 04:26:49 -07:00
Rohan Varma	7e82382ad5	Allow profiler to be enabled remotely with RPC (#38748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748 This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node. As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler). Changes in this PR: - New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types. - Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s - The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp` - Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types - Changes in request_callback to parse and turn on the profiler in a thread-local fashion - Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler - Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant. - Some changes to profiler.cpp to add useful helper methods/guards - toHere() is now profiled for RRefs - Unittests ghstack-source-id: 106134626 Test Plan: Added unittests, existing profiler unittests. Differential Revision: D19510010 fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157	2020-06-18 17:01:57 -07:00
Gemfield	034eddca01	Fix typos in RPC Docs (#40219 ) Summary: Environment variable MASTER_ADDRESS and MASTER_port should be MASTER_ADDR and MASTER_PORT respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40219 Differential Revision: D22116585 Pulled By: mrshenli fbshipit-source-id: d312ae66210b0a16ec3ab1f468b1654bb0a75a0f	2020-06-18 11:40:11 -07:00
Shihao Xu	f3f30d4354	[JIT x RPC] Consolidate RRef type class and RRef impl class (#35694 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35694 close https://github.com/pytorch/pytorch/issues/35110 Differential Revision: D7881729 fbshipit-source-id: eedda8f1b7510491886d469efeed4e002bb8b991	2020-06-18 07:46:38 -07:00
Luca Wehrstedt	7c9e78fdf5	[TensorPipe] Add options for agent, including backend killswitches (#40162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40162 The only public option is `num_worker_threads`. The other ones are private (as indicated by the leading underscore, is that enough?) and allow to specify a different set and order of transports/channels. These can thus be used to disable a backend (by not specifying it) or by forcing one (by raising its priority). They can therefore be used to work around defective backends, in case we'll find any post-release. ghstack-source-id: 106103238 Test Plan: Built //caffe2:ifbpy and, using TensorPipe's verbose logging, verified that the transports/channels I specified were indeed the ones that were being registered. Differential Revision: D22090661 fbshipit-source-id: 789bbe3bde4444cfa20c40276246e4ab67c50cd0	2020-06-18 02:54:17 -07:00
Shihao Xu	bc9e8af218	[distributed.nn] Change remote module template instantiator to write to tmp folder (#40173 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40173 - Avoid path sharing across runs and workers, so even the test methods/workers run in parallel on the same host, they don't interfere with each other. - On some environment (e.g. fb internal CI platform), the torch package file tree is not writable. But the temporary folder chosen by Python `tempfile` module is always writable, on linux it's "/tmp". close https://github.com/pytorch/pytorch/issues/40120 ghstack-source-id: 106086340 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \ buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_scripted_remote_module_template buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \ buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_non_scripted_remote_module_template ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork ``` Differential Revision: D5708493 fbshipit-source-id: dd92695682433aaf79d1912c7956cef40a450eaf	2020-06-17 15:01:30 -07:00
Pritam Damania	145df306ae	Avoid using default process group in ProcessGroupAgent. (#39909 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39909 As described in https://github.com/pytorch/pytorch/issues/33583, ProcessGroupAgent initializes the default process group and this causes issues if the user initializes the default process group themsleves. Either the RPC initialization would fail or the user's process group initialization would fail. To avoid this, I've changed ProcessGroupAgent init to create its own ProcessGroupGloo and not use the default one at all. Closes: https://github.com/pytorch/pytorch/issues/33583 ghstack-source-id: 105953303 Test Plan: waitforbuildbot Differential Revision: D22011868 fbshipit-source-id: 7346a3fcb2821a0bc08e0bdc0625947abb5ae16f	2020-06-16 12:00:29 -07:00
Shihao Xu	00651b8c93	[distribtued.nn] Implement TorchScript-compatible RemoteModule API (#37139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37139 See design doc in https://github.com/pytorch/pytorch/issues/37136 ghstack-source-id: 105926270 Test Plan: TODO: - Make the generated Interface usable. https://github.com/pytorch/pytorch/pull/37139#discussion_r434190978 - - Avoid generating the same template instances for Module that is not scriptable. - Remove "infer_module_interface_cls". - Use Python format instead of a CodeTemplate - Use Python tempfile to track and delete file. Does it work if there is crash. ``` buck test mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \ buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_scripted_remote_module_template buck build mode/dev-nosan //caffe2/test/distributed/nn/jit:test_instantiator && \ buck-out/gen/caffe2/test/distributed/nn/jit/test_instantiator\#binary.par -r test_instantiate_non_scripted_remote_module_template ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_spawn ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \ buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_user_provided_global_unique_name buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \ buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_async_script buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \ buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_sync_script buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \ buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_forward_with_kwargs buck build mode/dev-nosan //caffe2/test/distributed/nn/api:remote_module_fork && \ buck-out/gen/caffe2/test/distributed/nn/api/remote_module_fork\#binary.par -r test_user_provided_global_unique_name ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork ``` buck test mode/opt-asan //caffe2/test:jit -- 'test_script_forward_method_replacement buck build mode/dev-nosan //caffe2/test:jit && \ buck-out/gen/caffe2/test/jit\#binary.par -r 'test_script_forward_method_replacement' buck build mode/dev-nosan //caffe2/test:jit && \ buck-out/gen/caffe2/test/jit\#binary.par -r 'test_imported_classes' Differential Revision: D20499658 fbshipit-source-id: dd9383ae4eb2343366c11127664f845b91ca3b0a	2020-06-15 19:07:35 -07:00
Shihao Xu	d602950cb4	[torch.distributed.rpc] Add WorkerInfo python repr magic method (#40004 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40004 close https://github.com/pytorch/pytorch/issues/39965 ghstack-source-id: 105891281 Test Plan: buck test mode/opt-asan //caffe2/test:jit -- 'test_vae_quantized \(jit\.test_models\.TestModels\)' buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork Differential Revision: D5696583 fbshipit-source-id: 19570414dc833c38fcd1ad38d2f0a816dbf51743	2020-06-15 15:08:29 -07:00
Shen Li	3fb1e73a4e	Add rpc.async_execution support for rpc.remote on script functions (#39758 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39758 Test Plan: Imported from OSS Differential Revision: D21963789 Pulled By: mrshenli fbshipit-source-id: f16f464ba01401b160cc4d3daf036e4bc806d7ea	2020-06-10 13:17:07 -07:00
Luca Wehrstedt	9bfb91b50b	Fix possible deadlock in _wait_all_workers (#39535 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39535 This is my understanding of what could happen: on workerN (N != 0), `_wait_all_workers_sequence_id_to_states`, which is a `defaultdict`, is accessed twice: once in the body of `_wait_all_workers` (by the "main thread" of workerN) and once in `_set_proceed_shutdown_signal`, called by worker0 through a RPC call. I think the two could race and access the `_wait_all_workers_sequence_id_to_states` at the same time, and thus create two separate copies of `WaitAllWorkersStates`. One of those threads would wait on the event of one copy, but the other thread would set the event of the other copy. This lead to a deadlock, as the main thread would end up waiting forever. ghstack-source-id: 105283327 Test Plan: I added additional logging in those functions, ran a stress test of the RPC test suite, based on the logs I suspected that this could be the issue, fixed it and re-run the stress test and didn't see the bug anymore. This is admittedly not very convincing evidence, as I may just have been lucky that second time... Differential Revision: D21889752 fbshipit-source-id: 05ec710bd2930313e1480ae896b4b2f5f503aa17	2020-06-05 02:42:32 -07:00
Shen Li	8a6914ddb2	Add @rpc.functions.async_execution for rpc.remote (#39486 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39486 Test Plan: Imported from OSS Differential Revision: D21871422 Pulled By: mrshenli fbshipit-source-id: 3c432b7718a47732b2aee064c554f6bdcc5c95c1	2020-06-04 22:38:35 -07:00
Rohan Varma	8b2bb02e09	Implement timeout support for RRefs (#38590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 105232837 Test Plan: CI Differential Revision: D21588165 fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3	2020-06-04 02:14:42 -07:00
Shen Li	67cea74dd3	Add rpc.async_function decorator for TorchScript functions (#39267 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39267 When combined with `torch.jit.script`, the order of decorators matter. `rpc.functions.async_execution` must be the outmost one. The `async_execution` decorator will store the TorchScript function in attribute `_wrapped_async_rpc_function` on the wrapper function, and pass this wrapped TorchScript function (i.e., an instance of `torch.jit.ScriptFunction`) to RPC. The caller will mark the ScriptCall with `isAsyncExecution=true`, and the callee will extract the returned `Future` in C++ and install subsequent processing as a callback to that `Future`. Test Plan: Imported from OSS Differential Revision: D21792688 fbshipit-source-id: de095eb148d21e9114a478e9e6047c707d34fd07	2020-06-03 22:27:15 -07:00
Shen Li	a05ef17e46	Add rpc.functions.async_execution decorator for rpc_sync/rpc_async (#39216 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39216 The `rpc.functions.async_execution` decorator specifies that the wrapped function is guaranteed to return a `torch.futures.Future`. The decorator adds a `_wrapped_async_rpc_function` attribute to the wrapper function. The caller retrieves this information and then sets `isAsyncFunction` argument accordingly which is later added to PythonCall RPC message as a field. On the callee side, if the PythonCall carries an asynchronous function, it will cast the function's return value to a jit::PythonFutureWrapper object, and then install response creation and communication as a callback on the that jit::PythonFutureWrapper. For applications, this feature is useful when a function needs to wait for IO or additional singaling. In those cases, marking the user function as `rpc.functions.async_execution` will prevent it from blocking one thread on callee for too long. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D21779962 fbshipit-source-id: 6b6aa698bf6f91dad6ed2a7ee433df429b59e941	2020-06-02 23:21:25 -07:00
Omkar Salpekar	a6f0051db2	Fix test_get_and_set_timeout for TensorPipe Agent (#39353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39353 This test failed with TSAN since the shortened timeout prevented all messages from being processed within the timeout during Phase 1 of wait_all_workers during RPC shutdown. Phase 2 already had a longer timeout, so we extend this to Phase 1 as well. ghstack-source-id: 105045926 Test Plan: Ran the test_get_and_set_timeout with TSAN Differential Revision: D21826783 fbshipit-source-id: 7edfdeb50169b31e997dd36a3fd8eea0e9ae7189	2020-06-02 12:01:11 -07:00
Tongzhou Wang	3001facd7a	[doc] [distributed] fix typo (#39264 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39264 Differential Revision: D21791426 Pulled By: mrshenli fbshipit-source-id: c3aa8fda1893aa3c0f9ad3db7da25f1ee80303e8	2020-06-01 19:19:46 -07:00
Shihao Xu	45baf0e1a0	[Profiler x RPC] Enable RPC Server Global Profiler (#38847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38847 See motivation and design in https://github.com/pytorch/pytorch/issues/38845. Close https://github.com/pytorch/pytorch/issues/38845. Changes, - Add pre-request and post-response hooks to RPC "request_callback_impl.cpp". For a thread that executes RPC handler, check if the server-side global profiling is on. If it's on, enable profiling on this thread and after response, merge the thread-local profiling result into the global profiling state. - Add context-style Python API to parse the profiling Events into ranges represented by FunctionEvent. - Add data-structures to work as global profiling state that support nesting and container for consolidating results from multiple threads. Test, - Add a test that uses nested profiling range and inspect the profiling events. ghstack-source-id: 104991517 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_server_process_global_profiler Differential Revision: D5665992 fbshipit-source-id: 07f3bef5efd33d1214ef3404284c3803f5deca26	2020-06-01 12:35:52 -07:00
Luca Wehrstedt	54046c1024	[TensorPipe] Implement join correctly (#38933 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38933 Based on what I could understand from how the RPC shutdown operates and from what the ProcessGroup agent does, the join method is supposed to act as a barrier among all workers that waits until they all have finished all their pending work, including work that may be triggered by nested calls or by callbacks. ghstack-source-id: 104760684 Test Plan: Before this diff, the `test_user_rrefs_confirmed` test of the RPC suite was flakily deadlocking. After this, I haven't been able to repro that. Differential Revision: D21703020 fbshipit-source-id: 3d36c6544f1ba8e17ce27ef520ecfd30552045dd	2020-05-28 10:48:13 -07:00
Rohan Varma	01815be1e4	Infinite timeout for operations against ProcessGroup for RPC (#38577 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38577 We don't want to limit a timeout to 30 min since there could be no operations within that time frame. Bump to 2^31 - 1 (int32 max) ghstack-source-id: 104743727 Test Plan: CI Differential Revision: D21602425 fbshipit-source-id: ab002262f01664b538761202b3bd7584fcee3c6b	2020-05-27 22:35:13 -07:00
Mingzhe Li	6736a76cec	Back out "[RPC] [Minor] RPC entry point cleanup" Summary: Original commit changeset: b509c47fb612 (Note: this ignores all push blocking failures!) Reviewed By: xush6528 Differential Revision: D21669711 fbshipit-source-id: e452a513a2d22eaa3bffa333fdb3277fabc24b41	2020-05-20 15:35:24 -07:00
Shihao Xu	befc76bb65	[RPC] [Minor] RPC entry point cleanup (#34292 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34292 This is to finish a cleanup request from https://github.com/pytorch/pytorch/pull/34733#discussion_r392479110. ghstack-source-id: 104361618 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \ -r test_return_local_script_class_rref_in_py_and_use_in_script buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \ -r test_return_local_script_module_rref_in_py_and_use_in_script ``` Differential Revision: D7436759 fbshipit-source-id: b509c47fb612ec3486ff1199c005eba69480ee05	2020-05-19 14:23:11 -07:00
Rohan Varma	4d4895a62a	Use Future's then() API to fix RPC profiling (#38352 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38352 Fixes the RPC profiling by using the `then()` API added in https://github.com/pytorch/pytorch/pull/37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value) To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests. Re-enabled profiling tests and stress tested them 1000 times to verify the fix ghstack-source-id: 104086114 Test Plan: Re-enabled profiling tests Differential Revision: D21506940 fbshipit-source-id: 35cde22f0551c825c9bc98ddc24cca412878a63a	2020-05-14 12:52:45 -07:00
Shihao Xu	3d0279862d	Consolidate builtin/python_udf RPC to return ivalue::Future like torchscript RPC does (#35154 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35154 This is for issue https://github.com/pytorch/pytorch/issues/34999. close https://github.com/pytorch/pytorch/issues/34999. https://github.com/pytorch/pytorch/issues/34997 need more work. This will make a few work items easier, like 1) Dist autograd profiler, 2) JIT annotation for Future. Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_rref_forward_chain --stress-runs 100 buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \ -r test_call_method_on_rref ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- 'test_rref_proxy_class \(fb\.test_rpc_fork\.RpcTestWithFork\)' --stress-runs 100 test_rref_proxy_reuse test_handle_send_exceptions ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \ -r test_script_call_python_return_future ``` Differential Revision: D7722184 fbshipit-source-id: bd92b855bfea4913d6672700590c57622fa86e0e	2020-05-08 21:28:56 -07:00
Luca Wehrstedt	91f451a5e6	[TensorPipe] Do not require user to provide worker name-to-rank map (#38052 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38052 The initial version of the TensorPipe agent required the user to specify the full map between workers' names and their ids, on each worker. However it's enough for each worker to just specify their name and id, as these can then be exchanged using the store. Addresses #37784, although I think we can go further and use the store to also automatically assign ranks to workers, so that the user only needs to specify a name. ghstack-source-id: 103741595 (Note: this ignores all push blocking failures!) Test Plan: On worker 0: ``` In [1]: import os ...: import torch ...: import torch.distributed.rpc as rpc ...: os.environ["MASTER_ADDR"] = "127.0.0.1" ...: os.environ["MASTER_PORT"] = "8765" In [2]: rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2) In [3]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2))) Out[3]: tensor([[3., 3.], [3., 3.]]) In [4]: rpc.rpc_sync("bar", torch.add, args=(1, 2)) Out[4]: 3 ``` On worker 1: ``` In [1]: import os ...: import torch ...: import torch.distributed.rpc as rpc ...: os.environ["MASTER_ADDR"] = "127.0.0.1" ...: os.environ["MASTER_PORT"] = "8765" In [2]: rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2) ``` Then also tested by adding `rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method="file:///tmp/init/foo")` to `rpc_init`. Differential Revision: D21463833 fbshipit-source-id: b53d7af6fc060789358ac845aa1898ddea6e8f31	2020-05-08 10:48:48 -07:00
Quang Luong	9d7a79ac27	[Caffe2] raise exceptions instead of str (#37744 ) Summary: Some exceptions are not correctly wrapped inside a class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/37744 Differential Revision: D21388197 Pulled By: mrshenli fbshipit-source-id: 2d69e2543c2e05116c367d137968b982c254d2dc	2020-05-05 13:34:33 -07:00
Hongyi Jia	0549e1f384	[Tensorpipe/RPC] tensorpipe RPC agent (#35483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35483 Implement the initial version of TensorPipe RPC agent, and register to RPC registry to expose to Python interface. As a starter, it utilizes all available TensorPipe transports (shm, uv) and channels (basic, cma). Test Plan: https://our.intern.facebook.com/intern/diffusion/FBS/browse/master/fbcode/experimental/jiayisuse/tensorpipe_rpc export MASTER_ADDR=127.0.0.1 export MASTER_PORT=28500 buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:main ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/main.par buck build mode/dev-nosan mode/no-gpu //experimental/jiayisuse/tensorpipe_rpc:benchmark ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/benchmark.par Multiple connections with async echo ./buck-out/gen/experimental/jiayisuse/tensorpipe_rpc/async_echo.par Reviewed By: lw Differential Revision: D20088366 fbshipit-source-id: 980f641af3321ca93583c62753e1c9174b7d4afc	2020-05-05 05:47:43 -07:00
Rohan Varma	d639418307	Add timeout injection to faulty agent for testing (#37485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485 Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios. This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout. Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before. ghstack-source-id: 103341662 Test Plan: Added unit tests in `FaultyRpcAgentTest`. Differential Revision: D21296537 fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896	2020-05-01 23:48:28 -07:00
Rohan Varma	c0a985fcd6	Allow customizing retryable message types in Faulty agent tests (#37450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450 It doesn't seem like we could customize the retryable message types by passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture` overrode the `rpc_backend_options` function and provided the default list of retryable message types. Needed to fix this as part of adding timeout injection support as mentioned in https://github.com/pytorch/pytorch/issues/36272 ghstack-source-id: 103287164 Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details` Differential Revision: D21270127 fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa	2020-05-01 12:00:36 -07:00
Rohan Varma	4ff4119d45	[rpc] Move _set_rpc_backand and RpcBackendOptions to use float instead of timedelta (#37027 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027 The RPC timeout passed into rpc_sync and rpc_async after the below change is now float, so we should make these APIs consistent. ghstack-source-id: 102971906 Test Plan: Existing unittests, also added unittest testing specific timeout set in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling. Differential Revision: D21125171 fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1	2020-04-27 19:38:06 -07:00
Rohan Varma	7bd2014eec	[resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650 Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling). This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout. This diff does not address: 1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic. 2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called Added unit tests to confirm the current behavior ghstack-source-id: 102622601 Test Plan: Added unit tests in rpc_test Differential Revision: D20376953 fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47	2020-04-22 13:00:42 -07:00
David Reiss	e75fb4356b	Remove (most) Python 2 support from Python code (#35615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615 Python 2 has reached end-of-life and is no longer supported by PyTorch. Now we can clean up a lot of cruft that we put in place to support it. These changes were all done manually, and I skipped anything that seemed like it would take more than a few seconds, so I think it makes sense to review it manually as well (though using side-by-side view and ignoring whitespace change might be helpful). Test Plan: CI Differential Revision: D20842886 Pulled By: dreiss fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed	2020-04-22 09:23:14 -07:00
Shen Li	5c2b273089	Add RRef Python Helper to launch function on the referenced object (#36619 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36619 With this PR, applications no longer need to create dedicated helpers to run functions on the object referenced by an RRef. Instead, `rref.rpc_sync().some_func()` will use `rpc_sync` to run `some_func` on the owner of the RRef using the object referenced by the RRef. Similar helpers for `rref.rpc_async().some_func()` and `rref.remote().some_func()` are also added. An alternative design is to expose PyRRef as RRefBase and then implement everything in a new Python RRef class. However, the RRef class cannot directly inherit from PyRRef/RRefBase, otherwise we will need to let pyRemote* C++ functions to load RRef from Python and return an RRef instance. It is possible to let RRef hold a instance of PyRRef instead of inherit from it, but this does not look like a elegant design, as we will have RRef holding PyRRef and PyRRef holding the C++ RRef. Another alternative is to use dynamic method loading, by installing member methods to PyRRef instances. However, this would require different solutions to handle RRef(data) and rpc.remote(...). Base on the above thinking, we decided to go with the current implementation for simplicity and we can also keep all RRef-related APIs in one place. Test Plan: Imported from OSS Differential Revision: D21028333 Pulled By: mrshenli fbshipit-source-id: fe90f56ef7183d18874e357900093755e1601eb4	2020-04-21 19:29:54 -07:00
Shen Li	b982a6a247	Expose torch.distributed.is_available() API (#37021 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37021 Test Plan: Imported from OSS Differential Revision: D21164318 Pulled By: mrshenli fbshipit-source-id: 08a446af342cbe54f3eb4994956ffa7ef4922bcf	2020-04-21 18:38:46 -07:00
Rohan Varma	752d3c281a	[profiler] Allow record_function ctx manager to profile futures (#35055 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055 This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future. This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future. Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well. These code paths will be merged once the futures are merged. ghstack-source-id: 102478180 Test Plan: Added unit tests Differential Revision: D20452003 fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d	2020-04-20 12:37:54 -07:00
Pritam Damania	136d84dd38	Enhance error message for MPI unavailability. (#36781 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36781 Mention that you need to to build PyTorch from source to enable MPI. Additional context: https://discuss.pytorch.org/t/distributed-pytorch-with-mpi/77106. ghstack-source-id: 102341246 Test Plan: waitforbuildbot Differential Revision: D21082009 fbshipit-source-id: 3a3286349e71322726a341dfc743b5978c7d9a56	2020-04-18 14:45:44 -07:00
Sudarshan Raghunathan	739351fac4	Fix linter warning: replace f-strings with str.format for Py2 compat (#35492 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35492 Test Plan: Imported from OSS Differential Revision: D20998727 Pulled By: drdarshan fbshipit-source-id: 54f34a7649a2772ad030b456f1b50aba831ce2e0	2020-04-13 18:43:58 -07:00
Rohan Varma	f59e646faa	[rpc] Allow profiling in RPC to work with torchscript function invocations (#36275 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36275 Calling a TorchScript function from within RPC was added after initial support for the profiler with RPC, hence, we were not recording torchscript funtions invoked under RPC correctly. This diff passes the `RecordFunction` to the `_invoke_torchscript..` calls similar to what is done for builtin and UDFs. However, this is only a temporary solution. We will be removing the use of `RecordFunction` as a standalone in the RPC code in https://github.com/pytorch/pytorch/pull/35055. This diff is to unblock recording of torchscript functions in the meantime. ghstack-source-id: 101800134 Test Plan: Added tests for calling a script function with builtin, sync, and asyc. The output looks like below: ``` ------ --------------- --------------- --------------- --------------- --------------- > Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls > ---------------------------------------------------------------------------------------------------------- --------- ------ --------------- --------------- --------------- --------------- --------------- > rpc_sync#__torch__.torch.testing._internal.distributed.rpc.rpc_test.my_script_func(worker1 -> worker2) 99.92% 1.056s 99.92% 1.056s 1.056s 1 > select 0.04% 383.661us 0.04% 383.661us 95.915us 4 > fill_ 0.02% 210.966us 0.02% 210.966us 52.741us 4 > to 0.00% 26.276us 0.00% 26.276us 26.276us 1 > empty 0.02% 159.802us 0.02% 159.802us 79.901us 2 > set_ 0.01% 93.818us 0.01% 93.818us 93.818us 1 > ---------------------------------------------------------------------------------------------------------- --------- ------ --------------- --------------- --------------- --------------- --------------- > Self CPU time total: 1.057s ``` Note that we use `torch.jit._qualified_name` to get the name of the script fn. Differential Revision: D20930453 fbshipit-source-id: c6d940aa44fcd9dd8a1a29c156aa19e0d8428d60	2020-04-08 23:58:36 -07:00
Pritam Damania	82dd01150c	Fix race during RPC shutdown. (#36113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36113 As part of debugging https://github.com/pytorch/pytorch/issues/35863, I discovered that the unit test would timeout during clean shutdown. Looking into this further, it looks like there is a race in `_on_leader_follower_report_shutdown_intent` when multiple followers call the same method on the leader. To fix this, I've ensured we have an appropriate lock in `_on_leader_follower_report_shutdown_intent` to guard against this. I ran the test 500 times to validate that this fix works. Closes #35863 ghstack-source-id: 101641463 Test Plan: 1) waitforbuildbot 2) Ran the test 500 times. Differential Revision: D20884373 fbshipit-source-id: 9d580e9892adffc0c9a4c2e832881fb291a1ff16	2020-04-08 14:12:33 -07:00
Feng Tian	762270c51f	add c10d dynamic loading mechanism and unit test (#28068 ) Summary: The original behavior of pytorch c10d only supports built-in c10d backends, such as nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically loading 3rd party communication libraries which are derived from ProcessGroup base class. related RFC is in: https://github.com/pytorch/pytorch/issues/27955 Through this way, user just need specify a 3rd party c10d backend name when invoking torch.distributed.init_process_group(). The proposed logic will try to load corresponding c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068 Differential Revision: D19174838 Pulled By: agolynski fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62	2020-04-02 15:46:51 -07:00
Dhiraj D Kalamkar	945d7a7408	Add All-to-all comms support to distributed module and MPI backend (#32361 ) Summary: As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend. mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361 Reviewed By: mrshenli Differential Revision: D20635481 Pulled By: srinivas212 fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2	2020-04-01 08:57:12 -07:00

1 2 3 4 5 ...

259 Commits