Commit Graph

11 Commits

Author SHA1 Message Date
Rohan Varma
c0a985fcd6 Allow customizing retryable message types in Faulty agent tests (#37450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450

It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164

Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`

Differential Revision: D21270127

fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
2020-05-01 12:00:36 -07:00
Rohan Varma
7bd2014eec [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650

Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling).

This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout.

This diff does not address:
1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic.
2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called

Added unit tests to confirm the current behavior
ghstack-source-id: 102622601

Test Plan: Added unit tests in rpc_test

Differential Revision: D20376953

fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47
2020-04-22 13:00:42 -07:00
Rohan Varma
752d3c281a [profiler] Allow record_function ctx manager to profile futures (#35055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055

This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future.

This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future.

Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well.

These code paths will be merged once the futures are merged.
ghstack-source-id: 102478180

Test Plan: Added unit tests

Differential Revision: D20452003

fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d
2020-04-20 12:37:54 -07:00
Omkar Salpekar
4a49ad0da7 Fixed error Regex Parsing for Node Failure Tests (#36620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36620

Sending to a node that has been shutdown in ProcessGroupAgent could throw several possible exceptions. This PR updates the tests to check for the right exceptions while waiting for other nodes in the gang to fail in `test_backward_node_failure` and `test_backward_node_failure_python_udf`.
ghstack-source-id: 102153944

Test Plan: Stress-tested `test_backward_node_failure` and `test_backward_node_failure_python_udf`. They were previously completely broken, but this change makes  `test_backward_node_failure`  functional and `test_backward_node_failure_python_udf` is flaky but fails infrequently. A change to make the last test work reliably is planned.

Differential Revision: D21027280

fbshipit-source-id: e85c2d219ee408483442bd9925fff7206c8efe4b
2020-04-15 10:54:59 -07:00
Omkar Salpekar
4025729e88 [1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636

Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072

Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.

In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an  RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).

This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.

Differential Revision: D20019236

fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
2020-03-20 20:07:47 -07:00
Rohan Varma
ff3d205ee5 [rpc] handle exceptions in ProcessGroupAgent::enqueueRecv (#34413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34413

In this diff we have made various improvements to ProcessGroupAgent in order to accomodate edge and error cases such as a "non-clean" shutdown (shutdowns in which we abort RPC as quickly as possible, and don't wait for all pending work across all RPC agents to be completed):

1. Catch and log exceptions in `enqueueRecv`. This prevents us from calling `std::terminate()` in a different thread and logs an error message indicating the issue. With this we no longer have crashes caused by exceptions in this thread during non-graceful shutdown.

2. Provide cleaner error messages everywhere (and use `c10::str` where possible). One example is in `agent::send()`.

3. Add the ability to abort pending sends that cause blocking waits in `handleSend`. The reason we need to abort this is since during a non-graceful shutdown, we could become blocked waiting for these since there is no guarantee the remote end is still active and this would result in a long wait and eventual timeout. We abort these by adding them to a map, and go through this map during `shutdown()`.

4. Fix flaky tests: `test_handle_send_exceptions` and `test_backward_node_failure` and `test_backward_node_failure_python_udf`. These tests were flaky since they dealt with non-graceful shutdown of workers which has chances for a bunch of edge cases explained above.

We have also refactored `createExceptionResponse`, `enqueueRecv`, and some test functions for the above reasons in this diff.

For testing:
Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`.
ghstack-source-id: 100311598

Test Plan: Ensured that the tests are no longer flaky with 500 tests runs. Previously, these tests were flaky and disabled. Also added a unit test in the internal `ProcessGroupAgentTest.cpp`.

Reviewed By: mrshenli

Differential Revision: D20269074

fbshipit-source-id: de9cad7f7185f9864ffbb6b14cd8ca9f6ff8f465
2020-03-17 19:01:41 -07:00
Pritam Damania
7d9f611b64 Add worker_name helper to dist_utils. (#34162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34162

This avoids the "worker{}".format(..) in our unit tests to something
cleaner.
ghstack-source-id: 99713074

Test Plan: waitforbuildbot

Differential Revision: D20233533

fbshipit-source-id: 5cff952ca68af5a6d26dc5cc01463cf7756d83d9
2020-03-07 13:24:45 -08:00
Omkar Salpekar
ad78c0f4fc Fixed the flaky test_rref_context_debug_info (#32749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32749

The test was flaky since the message from owner RRef confirming fork would arrive after the test checked whether the pending User RRefs map was empty - leading to an assertion error. This diff creates a utility function that should be used by any test to wait for this message to complete processing before doing any assertions related to the pending User RRefs map.

GitHub Issue: https://github.com/pytorch/pytorch/issues/30988

Test Plan: Stress tested `test_rref_context_debug_info` 200 times.

Differential Revision: D19612289

fbshipit-source-id: 57a7c19b1cf792b94c263d3efbbbb6da60c07d07
2020-01-31 16:53:18 -08:00
Rohan Varma
9de3208449 [rpc][flaky-tests] fix for test_handle_send_exceptions and (#32656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32656

Fixes these flaky tests.

Test Plan: Run the test 500 times and verify that it succeeds every time.

Differential Revision: D19584453

fbshipit-source-id: 07cbc4914211f274182ac0fa74bb5ef6d43392d1
2020-01-28 12:40:12 -08:00
Shihao Xu
5c8535d5b0 Make C++ RpcAgent::currentRPCAgent_ the source of truth of current RPC Agent (#32633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32633

There were 2 sources of current RPC agent.

- One is in Python world, `torch.distributedrpc.api._agent`.
- The other is in C++ world, `RpcAgent::defaultRpcAgent_`

Setting Python `_agent` to `None`, does not necessarily reset the C++ `defaultRpcAgent_` to `nullptr`.

i.e.
```
 torch.distributedrpc.api._agent = None
```
does not translate to
```
RpcAgent::defaultRpcAgent_ = nullptr
```

This PR is to remove this ambiguity, and use the C++ pointer as source of truth.

The solution is to leverage a pybind11 behavior that it implicitly casts C++ `shared_ptr<RpcAgent>(nullptr)` to Python `None`.
ghstack-source-id: 97293315

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_duplicate_name

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_remote_module

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_embedding

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling

buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

Differential Revision: D5733066

fbshipit-source-id: b3e6032ee975f19ca556497edbbf40b517b25be8
2020-01-27 19:34:12 -08:00
Pritam Damania
f050b16dd9 Move pytorch distributed tests to separate folder for contbuild. (#30445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30445

Create distributed and rpc directories under caffe/test for better management
of unit tests.

Differential Revision: D18702786

fbshipit-source-id: e9daeed0cfb846ef68806f6decfcb57c0e0e3606
2020-01-22 21:16:59 -08:00