pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Rohan Varma	c0a985fcd6	Allow customizing retryable message types in Faulty agent tests (#37450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450 It doesn't seem like we could customize the retryable message types by passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture` overrode the `rpc_backend_options` function and provided the default list of retryable message types. Needed to fix this as part of adding timeout injection support as mentioned in https://github.com/pytorch/pytorch/issues/36272 ghstack-source-id: 103287164 Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details` Differential Revision: D21270127 fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa	2020-05-01 12:00:36 -07:00
Rohan Varma	7bd2014eec	[resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650 Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling). This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout. This diff does not address: 1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic. 2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called Added unit tests to confirm the current behavior ghstack-source-id: 102622601 Test Plan: Added unit tests in rpc_test Differential Revision: D20376953 fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47	2020-04-22 13:00:42 -07:00
Rohan Varma	752d3c281a	[profiler] Allow record_function ctx manager to profile futures (#35055 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055 This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future. This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future. Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well. These code paths will be merged once the futures are merged. ghstack-source-id: 102478180 Test Plan: Added unit tests Differential Revision: D20452003 fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d	2020-04-20 12:37:54 -07:00
Omkar Salpekar	4a49ad0da7	Fixed error Regex Parsing for Node Failure Tests (#36620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36620 Sending to a node that has been shutdown in ProcessGroupAgent could throw several possible exceptions. This PR updates the tests to check for the right exceptions while waiting for other nodes in the gang to fail in `test_backward_node_failure` and `test_backward_node_failure_python_udf`. ghstack-source-id: 102153944 Test Plan: Stress-tested `test_backward_node_failure` and `test_backward_node_failure_python_udf`. They were previously completely broken, but this change makes `test_backward_node_failure` functional and `test_backward_node_failure_python_udf` is flaky but fails infrequently. A change to make the last test work reliably is planned. Differential Revision: D21027280 fbshipit-source-id: e85c2d219ee408483442bd9925fff7206c8efe4b	2020-04-15 10:54:59 -07:00
Omkar Salpekar	4025729e88	[1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636 Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116, https://github.com/pytorch/pytorch/issues/33072 Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages. In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish). This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected. Differential Revision: D20019236 fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48	2020-03-20 20:07:47 -07:00
Rohan Varma	ff3d205ee5	[rpc] handle exceptions in ProcessGroupAgent::enqueueRecv (#34413 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34413 In this diff we have made various improvements to ProcessGroupAgent in order to accomodate edge and error cases such as a "non-clean" shutdown (shutdowns in which we abort RPC as quickly as possible, and don't wait for all pending work across all RPC agents to be completed): 1. Catch and log exceptions in `enqueueRecv`. This prevents us from calling `std::terminate()` in a different thread and logs an error message indicating the issue. With this we no longer have crashes caused by exceptions in this thread during non-graceful shutdown. 2. Provide cleaner error messages everywhere (and use `c10::str` where possible). One example is in `agent::send()`. 3. Add the ability to abort pending sends that cause blocking waits in `handleSend`. The reason we need to abort this is since during a non-graceful shutdown, we could become blocked waiting for these since there is no guarantee the remote end is still active and this would result in a long wait and eventual timeout. We abort these by adding them to a map, and go through this map during `shutdown()`. 4. Fix flaky tests: `test_handle_send_exceptions` and `test_backward_node_failure` and `test_backward_node_failure_python_udf`. These tests were flaky since they dealt with non-graceful shutdown of workers which has chances for a bunch of edge cases explained above. We have also refactored `createExceptionResponse`, `enqueueRecv`, and some test functions for the above reasons in this diff. For testing: Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`. ghstack-source-id: 100311598 Test Plan: Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`. Reviewed By: mrshenli Differential Revision: D20269074 fbshipit-source-id: de9cad7f7185f9864ffbb6b14cd8ca9f6ff8f465	2020-03-17 19:01:41 -07:00
Pritam Damania	7d9f611b64	Add worker_name helper to dist_utils. (#34162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34162 This avoids the "worker{}".format(..) in our unit tests to something cleaner. ghstack-source-id: 99713074 Test Plan: waitforbuildbot Differential Revision: D20233533 fbshipit-source-id: 5cff952ca68af5a6d26dc5cc01463cf7756d83d9	2020-03-07 13:24:45 -08:00
Omkar Salpekar	ad78c0f4fc	Fixed the flaky test_rref_context_debug_info (#32749 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32749 The test was flaky since the message from owner RRef confirming fork would arrive after the test checked whether the pending User RRefs map was empty - leading to an assertion error. This diff creates a utility function that should be used by any test to wait for this message to complete processing before doing any assertions related to the pending User RRefs map. GitHub Issue: https://github.com/pytorch/pytorch/issues/30988 Test Plan: Stress tested `test_rref_context_debug_info` 200 times. Differential Revision: D19612289 fbshipit-source-id: 57a7c19b1cf792b94c263d3efbbbb6da60c07d07	2020-01-31 16:53:18 -08:00
Rohan Varma	9de3208449	[rpc][flaky-tests] fix for test_handle_send_exceptions and (#32656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32656 Fixes these flaky tests. Test Plan: Run the test 500 times and verify that it succeeds every time. Differential Revision: D19584453 fbshipit-source-id: 07cbc4914211f274182ac0fa74bb5ef6d43392d1	2020-01-28 12:40:12 -08:00
Shihao Xu	5c8535d5b0	Make C++ RpcAgent::currentRPCAgent_ the source of truth of current RPC Agent (#32633 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32633 There were 2 sources of current RPC agent. - One is in Python world, `torch.distributedrpc.api._agent`. - The other is in C++ world, `RpcAgent::defaultRpcAgent_` Setting Python `_agent` to `None`, does not necessarily reset the C++ `defaultRpcAgent_` to `nullptr`. i.e. ``` torch.distributedrpc.api._agent = None ``` does not translate to ``` RpcAgent::defaultRpcAgent_ = nullptr ``` This PR is to remove this ambiguity, and use the C++ pointer as source of truth. The solution is to leverage a pybind11 behavior that it implicitly casts C++ `shared_ptr<RpcAgent>(nullptr)` to Python `None`. ghstack-source-id: 97293315 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_duplicate_name buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_remote_module buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_embedding buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc ``` Differential Revision: D5733066 fbshipit-source-id: b3e6032ee975f19ca556497edbbf40b517b25be8	2020-01-27 19:34:12 -08:00
Pritam Damania	f050b16dd9	Move pytorch distributed tests to separate folder for contbuild. (#30445 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445 Create distributed and rpc directories under caffe/test for better management of unit tests. Differential Revision: D18702786 fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606	2020-01-22 21:16:59 -08:00

11 Commits