Commit Graph

15 Commits

Author SHA1 Message Date
Luca Wehrstedt
f58cc4b444 [RPC] Fix flaky test by waiting for async rref calls (#39012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012

The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself.

What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail:
```
        self.assertEqual(2, int(info["num_owner_rrefs"]))
```
because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed.

The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier.
ghstack-source-id: 104810738

Test Plan: The test isn't flaky anymore.

Differential Revision: D21716070

fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643
2020-05-28 10:48:34 -07:00
Luca Wehrstedt
7866854184 [TensorPipe] Add cases for TP in RPC test helpers (#38927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38927

Since the regexs weren't matching the RPC tests would never confirm that the remote end had correctly shut down and were thus retrying in a loop forever.
ghstack-source-id: 104760686

Test Plan: Ran the RPC test suite after re-enabling some of the TensorPipe tests

Differential Revision: D21703018

fbshipit-source-id: 3e4b8d22810e58c9d72c4317dcf5ba68d6e0b258
2020-05-28 10:47:44 -07:00
Omkar Salpekar
7492e98c7f [Tensorpipe Agent] RPC, RRef tests for Tensorpipe Agent (#38444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38444

This enables the RPC/RRef test suites to run with the Tensorpipe RPC Agent. This creates a new fixture to ensure the backend/options used are Tensorpipe, as well as a decorator to skip tests that Tensorpipe currently cannot support due to missing functionality.

One small note: the decorator function is a class method of the test class so we can check whether `self.rpc_backend` is tensorpipe. In the class-scope, the `TEST_CONFIG.rpc_backend_name` string is set to Tensorpipe, but outside the class scope, it is PGA, possibly due to importing dist_utils which sets this config to PGA by default. The cleanest solution would be to refactor the backend selection to be more uniform (since currently every backend is set slightly differently), but that would be a longer-term fix.
ghstack-source-id: 104321885

Test Plan:
Note: A couple of these tests will fail right now due to missing features. I've skipped the ones that regularly fail, but there will be some flaky tests that still fail occasionally.

The decorator `@_skip_if_tensorpipe_agent` skips the tests that fail with the Tensorpipe Agent. Remove this decorator from above the tests once they are fixed.

Differential Revision: D21412016

fbshipit-source-id: 1e801ac5ccaf87974dd4df92d556895b01468bf3
2020-05-19 13:32:58 -07:00
Rohan Varma
d639418307 Add timeout injection to faulty agent for testing (#37485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485

Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios.

This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout.

Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before.
ghstack-source-id: 103341662

Test Plan: Added unit tests in `FaultyRpcAgentTest`.

Differential Revision: D21296537

fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896
2020-05-01 23:48:28 -07:00
Rohan Varma
c0a985fcd6 Allow customizing retryable message types in Faulty agent tests (#37450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450

It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164

Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`

Differential Revision: D21270127

fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
2020-05-01 12:00:36 -07:00
Rohan Varma
7bd2014eec [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650

Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling).

This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout.

This diff does not address:
1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic.
2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called

Added unit tests to confirm the current behavior
ghstack-source-id: 102622601

Test Plan: Added unit tests in rpc_test

Differential Revision: D20376953

fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47
2020-04-22 13:00:42 -07:00
Rohan Varma
752d3c281a [profiler] Allow record_function ctx manager to profile futures (#35055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055

This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future.

This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future.

Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well.

These code paths will be merged once the futures are merged.
ghstack-source-id: 102478180

Test Plan: Added unit tests

Differential Revision: D20452003

fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d
2020-04-20 12:37:54 -07:00
Omkar Salpekar
4a49ad0da7 Fixed error Regex Parsing for Node Failure Tests (#36620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36620

Sending to a node that has been shutdown in ProcessGroupAgent could throw several possible exceptions. This PR updates the tests to check for the right exceptions while waiting for other nodes in the gang to fail in `test_backward_node_failure` and `test_backward_node_failure_python_udf`.
ghstack-source-id: 102153944

Test Plan: Stress-tested `test_backward_node_failure` and `test_backward_node_failure_python_udf`. They were previously completely broken, but this change makes  `test_backward_node_failure`  functional and `test_backward_node_failure_python_udf` is flaky but fails infrequently. A change to make the last test work reliably is planned.

Differential Revision: D21027280

fbshipit-source-id: e85c2d219ee408483442bd9925fff7206c8efe4b
2020-04-15 10:54:59 -07:00
Omkar Salpekar
4025729e88 [1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636

Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072

Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.

In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an  RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).

This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.

Differential Revision: D20019236

fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
2020-03-20 20:07:47 -07:00
Rohan Varma
ff3d205ee5 [rpc] handle exceptions in ProcessGroupAgent::enqueueRecv (#34413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34413

In this diff we have made various improvements to ProcessGroupAgent in order to accomodate edge and error cases such as a "non-clean" shutdown (shutdowns in which we abort RPC as quickly as possible, and don't wait for all pending work across all RPC agents to be completed):

1. Catch and log exceptions in `enqueueRecv`. This prevents us from calling `std::terminate()` in a different thread and logs an error message indicating the issue. With this we no longer have crashes caused by exceptions in this thread during non-graceful shutdown.

2. Provide cleaner error messages everywhere (and use `c10::str` where possible). One example is in `agent::send()`.

3. Add the ability to abort pending sends that cause blocking waits in `handleSend`. The reason we need to abort this is since during a non-graceful shutdown, we could become blocked waiting for these since there is no guarantee the remote end is still active and this would result in a long wait and eventual timeout. We abort these by adding them to a map, and go through this map during `shutdown()`.

4. Fix flaky tests: `test_handle_send_exceptions` and `test_backward_node_failure` and `test_backward_node_failure_python_udf`. These tests were flaky since they dealt with non-graceful shutdown of workers which has chances for a bunch of edge cases explained above.

We have also refactored `createExceptionResponse`, `enqueueRecv`, and some test functions for the above reasons in this diff.

For testing:
Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`.
ghstack-source-id: 100311598

Test Plan: Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`.

Reviewed By: mrshenli

Differential Revision: D20269074

fbshipit-source-id: de9cad7f7185f9864ffbb6b14cd8ca9f6ff8f465
2020-03-17 19:01:41 -07:00
Pritam Damania
7d9f611b64 Add worker_name helper to dist_utils. (#34162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34162

This avoids the "worker{}".format(..) in our unit tests to something
cleaner.
ghstack-source-id: 99713074

Test Plan: waitforbuildbot

Differential Revision: D20233533

fbshipit-source-id: 5cff952ca68af5a6d26dc5cc01463cf7756d83d9
2020-03-07 13:24:45 -08:00
Omkar Salpekar
ad78c0f4fc Fixed the flaky test_rref_context_debug_info (#32749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32749

The test was flaky since the message from owner RRef confirming fork would arrive after the test checked whether the pending User RRefs map was empty - leading to an assertion error. This diff creates a utility function that should be used by any test to wait for this message to complete processing before doing any assertions related to the pending User RRefs map.

GitHub Issue: https://github.com/pytorch/pytorch/issues/30988

Test Plan: Stress tested `test_rref_context_debug_info` 200 times.

Differential Revision: D19612289

fbshipit-source-id: 57a7c19b1cf792b94c263d3efbbbb6da60c07d07
2020-01-31 16:53:18 -08:00
Rohan Varma
9de3208449 [rpc][flaky-tests] fix for test_handle_send_exceptions and (#32656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32656

Fixes these flaky tests.

Test Plan: Run the test 500 times and verify that it succeeds every time.

Differential Revision: D19584453

fbshipit-source-id: 07cbc4914211f274182ac0fa74bb5ef6d43392d1
2020-01-28 12:40:12 -08:00
Shihao Xu
5c8535d5b0 Make C++ RpcAgent::currentRPCAgent_ the source of truth of current RPC Agent (#32633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32633

There were 2 sources of current RPC agent.

- One is in Python world, `torch.distributedrpc.api._agent`.
- The other is in C++ world, `RpcAgent::defaultRpcAgent_`

Setting Python `_agent` to `None`, does not necessarily reset the C++ `defaultRpcAgent_` to `nullptr`.

i.e.
```
 torch.distributedrpc.api._agent = None
```
does not translate to
```
RpcAgent::defaultRpcAgent_ = nullptr
```

This PR is to remove this ambiguity, and use the C++ pointer as source of truth.

The solution is to leverage a pybind11 behavior that it implicitly casts C++ `shared_ptr<RpcAgent>(nullptr)` to Python `None`.
ghstack-source-id: 97293315

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_duplicate_name

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_remote_module

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_embedding

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling

buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

Differential Revision: D5733066

fbshipit-source-id: b3e6032ee975f19ca556497edbbf40b517b25be8
2020-01-27 19:34:12 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00