pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Philip Meier	d5988c5eca	remove unused `type: ignore` directives (#60006 ) Summary: During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern. With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006 Reviewed By: jbschlosser, malfet Differential Revision: D29133237 Pulled By: albanD fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a	2021-06-18 07:23:31 -07:00
Yi Wang	7989f2ac87	Clang format dist_utils.py and rpc/__init__.py (#56853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56853 ghstack-source-id: 127412640 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D27984669 fbshipit-source-id: 8e89ba0c53107622b3ca29ea296226e260b251df	2021-04-26 11:33:42 -07:00
Richard Barnes	697b130374	Add some missing types to torch (#55184 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55184 Test Plan: Sandcastle Reviewed By: ezyang Differential Revision: D27515470 fbshipit-source-id: 264bc067db8fb430465d14bf9508ac8b1faf0f2f	2021-04-05 21:44:47 -07:00
Rohan Varma	27ab9bc0f9	[RPC profiling] Extend RPC profiling to support async function execution over RPC. (#44664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664 Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run. To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node. For example, if the following async function is ran on a server over RPC: ``` def slow_add(x, y): time.sleep(1) return torch.add(x, y) rpc.functions.async_execution def slow_async_add(to, x, y): return rpc.rpc_async(to, slow_add, args=(x, y)) ``` we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output: ``` ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s 1.012s 1 1 aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s 1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us 11.843us 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us 22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- -------- ------- --------------- --------------- --------------- Self CPU time total: 164.164us ``` This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code. ghstack-source-id: 112868470 Test Plan: ``` rvarm1@devbig978:fbcode (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1 ``` Reviewed By: mrshenli Differential Revision: D23638387 fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4	2020-09-25 13:19:26 -07:00
gunandrose4u	f07ac6a004	Fix Windows build failure after DDP PR merged (#45335 ) Summary: Fixes #{issue number} This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335 Reviewed By: zou3519 Differential Revision: D23931471 Pulled By: mrshenli fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494	2020-09-25 12:37:50 -07:00
Mike Ruberry	103fa3894a	Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only Test Plan: revert-hammer Differential Revision: D23841786 (`0122299f9b`) Original commit changeset: 334ba1ed73ef fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f	2020-09-24 22:44:33 -07:00
gunandrose4u	0122299f9b	Enable distributed package on windows, Gloo backend supported only (#42897 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42095 For test case part will be committed to this PR later mrshenli, please help to review Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897 Reviewed By: osalpekar Differential Revision: D23841786 Pulled By: mrshenli fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3	2020-09-24 21:13:55 -07:00
Sam Estep	c2cf6efd96	Enable type check for torch.testing._internal.dist_utils.* (#44832 ) Summary: Addresses a sub-task of https://github.com/pytorch/pytorch/issues/44752. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44832 Reviewed By: malfet Differential Revision: D23744260 Pulled By: samestep fbshipit-source-id: 46aede57b4fa66a770d5df382b0aea2bd6772b9b	2020-09-18 12:50:48 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Luca Wehrstedt	2e7b464c43	[RPC tests] Remove global TEST_CONFIG (#40822 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40822 Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- This is the last step of removing TEST_CONFIG. As there was no one left using it, there is really not much to it. ghstack-source-id: 109229471 Test Plan: Sandcastle and CircleCI Reviewed By: pritamdamania87 Differential Revision: D22307778 fbshipit-source-id: 0d9498d9367eec671e0a964ce693015f73c5638c	2020-08-05 15:10:20 -07:00
Luca Wehrstedt	e7c7eaab82	[RPC tests] Move some functions to methods of fixture (#40821 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40821 Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- This change continues the work towards removing TEST_CONFIG, by taking a few functions that were accepting the agent name (as obtained from TEST_CONFIG) and then did a bunch of if/elses on it, and replace them by new abstract methods on the fixtures, so that these functions become "decentralized". ghstack-source-id: 109229472 Test Plan: Sandcastle and CircleCI Reviewed By: pritamdamania87 Differential Revision: D22307776 fbshipit-source-id: 9e1f6edca79aacf0bcf9d83d50ce9e0d2beec0dd	2020-08-05 15:10:17 -07:00
Luca Wehrstedt	2acef69ce3	[RPC tests] Make generic fixture an abstract base class (#40820 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40820 Summary of the entire stack: -- This diff is part of an attempt to refactor the RPC tests. They currently suffer from several problems: - Several ways to specify the agent to use: there exists one "generic" fixture that uses the global variable TEST_CONFIG to look up the agent name, and is used for process group and Thrift, and then there are separate fixtures for the flaky agent and the TensorPipe one. - These two ways lead to having two separate decorators (`requires_process_group_agent` and `@_skip_if_tensorpipe_agent`) which must both be specified, making it unclear what the effect of each of them is and what happens if only one is given. - Thrift must override the TEST_CONFIG global variable before any other import (in order for the `requires_process_group_agent` decorator to work correctly) and for that it must use a "trap" file, which makes it even harder to track which agent is being used, and which is specific to Buck, and thus cannot be used in OSS by other agents. - Even if the TensorPipe fixture doesn't use TEST_CONFIG, it still needs to set it to the right value for other parts of the code to work. (This is done in `dist_init`). - There are a few functions in dist_utils.py that return some properties of the agent (e.g., a regexp to match against the error it returns in case of shutdown). These functions are effectively chained if/elses on the various agents, which has the effect of "leaking" some part of the Thrift agent into OSS. - Each test suite (RPC, dist autograd/dist optimizer, their JIT versions, remote module, ...) must be run on each agent (or almost; the faulty one is an exception) in both fork and spawn mode. Each of these combinations is a separate file, which leads to a proliferation of scripts. - There is no "master list" of what combinations make sense and should be run. Therefore it has happened that when adding new tests or new agents we forgot to enroll them into the right tests. (TensorPipe is still missing a few tests, it turns out). - All of these tiny "entry point" files contain almost the same duplicated boilerplate. This makes it very easy to get the wrong content into one of them due to a bad copy-paste. This refactoring aims to address these problems by: - Avoiding global state, defaults/override, traps, if/elses, ... and have a single way to specify the agent, based on an abstract base class and several concrete subclasses which can be "mixed in" to any test suite. - Instead of enabling/disabling tests using decorators, the tests that are specific to a certain agent are now in a separate class (which is a subclass of the "generic" test suite) so that they are only picked up by the agent they apply to. - Instead of having one separate entry point script for each combination, it uses one entry point for each agent, and in that script it provides a list of all the test suites it wants to run on that agent. And it does that by trying to deduplicate the boilerplate as much as possible. (In fact, the various agent-suite combinations could be grouped in any way, not necessarily by agent as I did here). It provides further advantages: - It puts all the agents on equal standing, by not having any of them be the default, making it thus easier to migrate from process group to TensorPipe. - It will make it easier to add more versions of the TensorPipe tests (e.g., one that disables the same-machine backends in order to test the TCP-based ones) without a further duplication of entry points, of boilerplate, ... Summary of this commit -- Now that no one is using the generic fixture anymore (i.e., the fixture that looks up the agent's name in the global TEST_CONFIG) we can make it abstract, i.e., have its methods become no-ops and add decorators that will require all subclasses to provide new implementations of those methods. This is a first step towards removing TEST_CONFIG. ghstack-source-id: 109229475 Test Plan: Sandcastle and CircleCI Reviewed By: pritamdamania87 Differential Revision: D22307777 fbshipit-source-id: e52abd915c37894933545eebdfdca3ecb9559926	2020-08-05 15:10:14 -07:00
Pritam Damania	872237c1f2	Output to stderr in distributed tests. (#42139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42139 A bunch of tests were failing with buck since we would output to stdout and buck would fail parsing stdout in some cases. Moving these print statements to stderr fixes this issue. ghstack-source-id: 108606579 Test Plan: Run the offending unit tests. Reviewed By: mrshenli Differential Revision: D22779135 fbshipit-source-id: 789af3b16a03b68a6cb12377ed852e5b5091bbad	2020-07-29 19:23:34 -07:00
Rohan Varma	b5e32528d0	Fix flaky test_udf_remote_message_delay_timeout_to_self (#41217 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217 Fixes this flaky test. Due to the possibility of callback finishCreatingOwnerRRef running after request_callback has processed and created the owner RRef, we could actually end up with 0 owners on the node, since the callback removes from the owners_ map. In this case, shutdown is fine since there are no owners. On the other hand, if the callback runs first, there will be 1 owner which we will delete in shutdown when we detect it has no forks. So either way, shutdown works fine and we don't need to enforce there to be 1 owner. ghstack-source-id: 107883497 Test Plan: Ran the test 500 times with TSAN. Reviewed By: ezyang Differential Revision: D22469806 fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46	2020-07-16 11:20:56 -07:00
Rohan Varma	8b2bb02e09	Implement timeout support for RRefs (#38590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 105232837 Test Plan: CI Differential Revision: D21588165 fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3	2020-06-04 02:14:42 -07:00
Luca Wehrstedt	f58cc4b444	[RPC] Fix flaky test by waiting for async rref calls (#39012 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012 The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself. What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail: ``` self.assertEqual(2, int(info["num_owner_rrefs"])) ``` because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed. The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier. ghstack-source-id: 104810738 Test Plan: The test isn't flaky anymore. Differential Revision: D21716070 fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643	2020-05-28 10:48:34 -07:00
Luca Wehrstedt	7866854184	[TensorPipe] Add cases for TP in RPC test helpers (#38927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38927 Since the regexs weren't matching the RPC tests would never confirm that the remote end had correctly shut down and were thus retrying in a loop forever. ghstack-source-id: 104760686 Test Plan: Ran the RPC test suite after re-enabling some of the TensorPipe tests Differential Revision: D21703018 fbshipit-source-id: 3e4b8d22810e58c9d72c4317dcf5ba68d6e0b258	2020-05-28 10:47:44 -07:00
Omkar Salpekar	7492e98c7f	[Tensorpipe Agent] RPC, RRef tests for Tensorpipe Agent (#38444 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38444 This enables the RPC/RRef test suites to run with the Tensorpipe RPC Agent. This creates a new fixture to ensure the backend/options used are Tensorpipe, as well as a decorator to skip tests that Tensorpipe currently cannot support due to missing functionality. One small note: the decorator function is a class method of the test class so we can check whether `self.rpc_backend` is tensorpipe. In the class-scope, the `TEST_CONFIG.rpc_backend_name` string is set to Tensorpipe, but outside the class scope, it is PGA, possibly due to importing dist_utils which sets this config to PGA by default. The cleanest solution would be to refactor the backend selection to be more uniform (since currently every backend is set slightly differently), but that would be a longer-term fix. ghstack-source-id: 104321885 Test Plan: Note: A couple of these tests will fail right now due to missing features. I've skipped the ones that regularly fail, but there will be some flaky tests that still fail occasionally. The decorator `@_skip_if_tensorpipe_agent` skips the tests that fail with the Tensorpipe Agent. Remove this decorator from above the tests once they are fixed. Differential Revision: D21412016 fbshipit-source-id: 1e801ac5ccaf87974dd4df92d556895b01468bf3	2020-05-19 13:32:58 -07:00
Rohan Varma	d639418307	Add timeout injection to faulty agent for testing (#37485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485 Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios. This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout. Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before. ghstack-source-id: 103341662 Test Plan: Added unit tests in `FaultyRpcAgentTest`. Differential Revision: D21296537 fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896	2020-05-01 23:48:28 -07:00
Rohan Varma	c0a985fcd6	Allow customizing retryable message types in Faulty agent tests (#37450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450 It doesn't seem like we could customize the retryable message types by passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture` overrode the `rpc_backend_options` function and provided the default list of retryable message types. Needed to fix this as part of adding timeout injection support as mentioned in https://github.com/pytorch/pytorch/issues/36272 ghstack-source-id: 103287164 Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details` Differential Revision: D21270127 fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa	2020-05-01 12:00:36 -07:00
Rohan Varma	7bd2014eec	[resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650 Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling). This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout. This diff does not address: 1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic. 2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called Added unit tests to confirm the current behavior ghstack-source-id: 102622601 Test Plan: Added unit tests in rpc_test Differential Revision: D20376953 fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47	2020-04-22 13:00:42 -07:00
Rohan Varma	752d3c281a	[profiler] Allow record_function ctx manager to profile futures (#35055 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055 This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future. This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future. Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well. These code paths will be merged once the futures are merged. ghstack-source-id: 102478180 Test Plan: Added unit tests Differential Revision: D20452003 fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d	2020-04-20 12:37:54 -07:00
Omkar Salpekar	4a49ad0da7	Fixed error Regex Parsing for Node Failure Tests (#36620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36620 Sending to a node that has been shutdown in ProcessGroupAgent could throw several possible exceptions. This PR updates the tests to check for the right exceptions while waiting for other nodes in the gang to fail in `test_backward_node_failure` and `test_backward_node_failure_python_udf`. ghstack-source-id: 102153944 Test Plan: Stress-tested `test_backward_node_failure` and `test_backward_node_failure_python_udf`. They were previously completely broken, but this change makes `test_backward_node_failure` functional and `test_backward_node_failure_python_udf` is flaky but fails infrequently. A change to make the last test work reliably is planned. Differential Revision: D21027280 fbshipit-source-id: e85c2d219ee408483442bd9925fff7206c8efe4b	2020-04-15 10:54:59 -07:00
Omkar Salpekar	4025729e88	[1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636 Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116, https://github.com/pytorch/pytorch/issues/33072 Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages. In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish). This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected. Differential Revision: D20019236 fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48	2020-03-20 20:07:47 -07:00
Rohan Varma	ff3d205ee5	[rpc] handle exceptions in ProcessGroupAgent::enqueueRecv (#34413 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34413 In this diff we have made various improvements to ProcessGroupAgent in order to accomodate edge and error cases such as a "non-clean" shutdown (shutdowns in which we abort RPC as quickly as possible, and don't wait for all pending work across all RPC agents to be completed): 1. Catch and log exceptions in `enqueueRecv`. This prevents us from calling `std::terminate()` in a different thread and logs an error message indicating the issue. With this we no longer have crashes caused by exceptions in this thread during non-graceful shutdown. 2. Provide cleaner error messages everywhere (and use `c10::str` where possible). One example is in `agent::send()`. 3. Add the ability to abort pending sends that cause blocking waits in `handleSend`. The reason we need to abort this is since during a non-graceful shutdown, we could become blocked waiting for these since there is no guarantee the remote end is still active and this would result in a long wait and eventual timeout. We abort these by adding them to a map, and go through this map during `shutdown()`. 4. Fix flaky tests: `test_handle_send_exceptions` and `test_backward_node_failure` and `test_backward_node_failure_python_udf`. These tests were flaky since they dealt with non-graceful shutdown of workers which has chances for a bunch of edge cases explained above. We have also refactored `createExceptionResponse`, `enqueueRecv`, and some test functions for the above reasons in this diff. For testing: Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`. ghstack-source-id: 100311598 Test Plan: Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`. Reviewed By: mrshenli Differential Revision: D20269074 fbshipit-source-id: de9cad7f7185f9864ffbb6b14cd8ca9f6ff8f465	2020-03-17 19:01:41 -07:00
Pritam Damania	7d9f611b64	Add worker_name helper to dist_utils. (#34162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34162 This avoids the "worker{}".format(..) in our unit tests to something cleaner. ghstack-source-id: 99713074 Test Plan: waitforbuildbot Differential Revision: D20233533 fbshipit-source-id: 5cff952ca68af5a6d26dc5cc01463cf7756d83d9	2020-03-07 13:24:45 -08:00
Omkar Salpekar	ad78c0f4fc	Fixed the flaky test_rref_context_debug_info (#32749 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32749 The test was flaky since the message from owner RRef confirming fork would arrive after the test checked whether the pending User RRefs map was empty - leading to an assertion error. This diff creates a utility function that should be used by any test to wait for this message to complete processing before doing any assertions related to the pending User RRefs map. GitHub Issue: https://github.com/pytorch/pytorch/issues/30988 Test Plan: Stress tested `test_rref_context_debug_info` 200 times. Differential Revision: D19612289 fbshipit-source-id: 57a7c19b1cf792b94c263d3efbbbb6da60c07d07	2020-01-31 16:53:18 -08:00
Rohan Varma	9de3208449	[rpc][flaky-tests] fix for test_handle_send_exceptions and (#32656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32656 Fixes these flaky tests. Test Plan: Run the test 500 times and verify that it succeeds every time. Differential Revision: D19584453 fbshipit-source-id: 07cbc4914211f274182ac0fa74bb5ef6d43392d1	2020-01-28 12:40:12 -08:00
Shihao Xu	5c8535d5b0	Make C++ RpcAgent::currentRPCAgent_ the source of truth of current RPC Agent (#32633 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32633 There were 2 sources of current RPC agent. - One is in Python world, `torch.distributedrpc.api._agent`. - The other is in C++ world, `RpcAgent::defaultRpcAgent_` Setting Python `_agent` to `None`, does not necessarily reset the C++ `defaultRpcAgent_` to `nullptr`. i.e. ``` torch.distributedrpc.api._agent = None ``` does not translate to ``` RpcAgent::defaultRpcAgent_ = nullptr ``` This PR is to remove this ambiguity, and use the C++ pointer as source of truth. The solution is to leverage a pybind11 behavior that it implicitly casts C++ `shared_ptr<RpcAgent>(nullptr)` to Python `None`. ghstack-source-id: 97293315 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_duplicate_name buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_remote_module buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_embedding buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc ``` Differential Revision: D5733066 fbshipit-source-id: b3e6032ee975f19ca556497edbbf40b517b25be8	2020-01-27 19:34:12 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00

30 Commits