Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68128
Reland of D31762735 (0cbfd466d2).
This diff was originally reverted due to failure in test_send_export_type_through_rpc_with_custom_pickler.
I updated rpc_pickler_test.py to prevent a race condition where processes were not registering their pickler before handling their rpc_sync calls.
Test Plan:
rpc_pickler_test file:
buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test //caffe2/torch/fb/training_toolkit/backend/metrics/collectors/fbdata_aggregator/tests:batch_collector_test -- --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx
rpc_pickler stress test:
buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test -- --exact 'caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test - test_send_export_type_through_rpc_with_custom_pickler (caffe2.torch.fb.training_toolkit.backend.metrics.tests.rpc_pickler_test.CythonTypeRpcSpawnTest)' --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx --jobs 18 --stress-runs 10 --record-results
Reviewed By: mrshenli
Differential Revision: D32316077
fbshipit-source-id: e58de2335fbaa3ab46d46fe222c659197633a5e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67924
This diff reverts the changes made in D31762735 (0cbfd466d2)
Test Plan: Wait for CI
Reviewed By: derekmod-fb
Differential Revision: D32214744
fbshipit-source-id: e0a65b6a31a88216ae1243549fcbc901ef812374
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62409
This a reland of #61907 because removing process_group_agent.h / cpp broke facebook specific tests. I will remove the files and update the internal test code in a separate PR.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D29990001
Pulled By: H-Huang
fbshipit-source-id: 2ee333322247d8b72691152308c3297e8c0c006d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61907
Removing the code for faulty process group agent since it was replaced by faulty tensorpipe agent
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D29794666
Pulled By: H-Huang
fbshipit-source-id: 0b35191cc07220b6774ecacc8d004f25fd2e87f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61421
This PR adds the faulty tensorpipe agent implementation and replaces all faulty process group agent tests with it. The faulty tensorpipe agent code is very similar to that of faulty process group agent. It allows the user to fail or delay certain types of rpc messages, which is used in the faulty agent tests. These changes are needed to deprecate the process group rpc backend.
Summary of changes:
- Add faulty tensorpipe agent class
- Update tensorpipe pipeWrite function to allow to be overwritten and add delay
- Update test backend registry and faulty agent tests to use the FAULTY_TENSORPIPE_AGENT backend.
This effects all faulty agent tests, here a few of them as sample commands:
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_verify_backend_options`
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_no_faulty_messages`
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_builtin_remote_message_dropped_timeout`
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D29773739
Pulled By: H-Huang
fbshipit-source-id: 6b2bc366735d70b79943d4207f454bc9555bbf5f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47531
This is part of a stack of PRs that fixes mypy typing errors in the torch.distributed.* directory.
Test Plan:
python test_type_hints.py -v TestTypeHints.test_run_mypy
Imported from OSS
Reviewed By: walterddr
Differential Revision: D24952499
fbshipit-source-id: b193171e28c2211a71d28a544fa44770bf938a1e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485
Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios.
This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout.
Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before.
ghstack-source-id: 103341662
Test Plan: Added unit tests in `FaultyRpcAgentTest`.
Differential Revision: D21296537
fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450
It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164
Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`
Differential Revision: D21270127
fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027
The RPC timeout passed into rpc_sync and rpc_async after the below
change is now float, so we should make these APIs consistent.
ghstack-source-id: 102971906
Test Plan:
Existing unittests, also added unittest testing specific timeout set
in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling.
Differential Revision: D21125171
fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636
Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072
Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.
In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).
This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.
Differential Revision: D20019236
fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48