Commit Graph

15 Commits

Author SHA1 Message Date
Justin Chu
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
Howard Huang
7b376bf844 Remove ProcessGroup from TensorPipeAgent initialization (#68128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68128

Reland of D31762735 (0cbfd466d2).

This diff was originally reverted due to failure in test_send_export_type_through_rpc_with_custom_pickler.

I updated rpc_pickler_test.py to prevent a race condition where processes were not registering their pickler before handling their rpc_sync calls.

Test Plan:
rpc_pickler_test file:

buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test //caffe2/torch/fb/training_toolkit/backend/metrics/collectors/fbdata_aggregator/tests:batch_collector_test -- --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx

rpc_pickler stress test:

buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test -- --exact 'caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test - test_send_export_type_through_rpc_with_custom_pickler (caffe2.torch.fb.training_toolkit.backend.metrics.tests.rpc_pickler_test.CythonTypeRpcSpawnTest)' --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx --jobs 18 --stress-runs 10 --record-results

Reviewed By: mrshenli

Differential Revision: D32316077

fbshipit-source-id: e58de2335fbaa3ab46d46fe222c659197633a5e4
2021-11-11 12:28:55 -08:00
Howard Huang
9fb3ba9d7b Revert D31762735 (#67924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67924

This diff reverts the changes made in D31762735 (0cbfd466d2)

Test Plan: Wait for CI

Reviewed By: derekmod-fb

Differential Revision: D32214744

fbshipit-source-id: e0a65b6a31a88216ae1243549fcbc901ef812374
2021-11-06 17:34:13 -07:00
Howard Huang
0cbfd466d2 Remove ProcessGroup from TensorPipeAgent initialization (#66708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66708

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31762735

Pulled By: H-Huang

fbshipit-source-id: 9f3879fca6b8258f7e6171b14d2c1d6cce21627d
2021-11-01 14:15:27 -07:00
Howard Huang
b3781f0244 Remove faulty process group agent logic (#62409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62409

This a reland of #61907 because removing process_group_agent.h / cpp broke facebook specific tests. I will remove the files and update the internal test code in a separate PR.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29990001

Pulled By: H-Huang

fbshipit-source-id: 2ee333322247d8b72691152308c3297e8c0c006d
2021-07-30 08:12:48 -07:00
Howard Huang
a15fff0a7f Revert D29794666: Remove faulty process group code
Test Plan: revert-hammer

Differential Revision:
D29794666 (afe3644321)

Original commit changeset: 0b35191cc072

fbshipit-source-id: 6467bc5100f4115f2fdb385e205740cd68c89743
2021-07-28 10:15:34 -07:00
Howard Huang
afe3644321 Remove faulty process group code (#61907)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61907

Removing the code for faulty process group agent since it was replaced by faulty tensorpipe agent

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29794666

Pulled By: H-Huang

fbshipit-source-id: 0b35191cc07220b6774ecacc8d004f25fd2e87f0
2021-07-27 07:37:40 -07:00
Howard Huang
e8d2916b84 Add faulty tensorpipe implementation (#61421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61421

This PR adds the faulty tensorpipe agent implementation and replaces all faulty process group agent tests with it. The faulty tensorpipe agent code is very similar to that of faulty process group agent. It allows the user to fail or delay certain types of rpc messages, which is used in the faulty agent tests. These changes are needed to deprecate the process group rpc backend.

Summary of changes:
- Add faulty tensorpipe agent class
- Update tensorpipe pipeWrite function to allow to be overwritten and add delay
- Update test backend registry and faulty agent tests to use the FAULTY_TENSORPIPE_AGENT backend.

This effects all faulty agent tests, here a few of them as sample commands:
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_verify_backend_options`
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_no_faulty_messages`
`pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_builtin_remote_message_dropped_timeout`

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29773739

Pulled By: H-Huang

fbshipit-source-id: 6b2bc366735d70b79943d4207f454bc9555bbf5f
2021-07-20 13:54:30 -07:00
Shen Li
c7b1979b6b Use Store collect and verify names in all RPC agents (#53209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53209

closes #40048

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26791524

Pulled By: mrshenli

fbshipit-source-id: fc75589f9707014334fcfae6f05af3c04217783b
2021-03-07 16:51:46 -08:00
Xu Zhao
49eb82a7b2 Fix type annotation errors in torch.distributed.* directory (#47531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47531

This is part of a stack of PRs that fixes mypy typing errors in the torch.distributed.* directory.

Test Plan:
python test_type_hints.py -v TestTypeHints.test_run_mypy

Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952499

fbshipit-source-id: b193171e28c2211a71d28a544fa44770bf938a1e
2020-11-16 23:23:13 -08:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Rohan Varma
d639418307 Add timeout injection to faulty agent for testing (#37485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485

Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios.

This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout.

Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before.
ghstack-source-id: 103341662

Test Plan: Added unit tests in `FaultyRpcAgentTest`.

Differential Revision: D21296537

fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896
2020-05-01 23:48:28 -07:00
Rohan Varma
c0a985fcd6 Allow customizing retryable message types in Faulty agent tests (#37450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450

It doesn't seem like we could customize the retryable message types by
passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture`
overrode the `rpc_backend_options` function and provided the default list of
retryable message types. Needed to fix this as part of adding timeout injection
support as mentioned in https://github.com/pytorch/pytorch/issues/36272
ghstack-source-id: 103287164

Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details`

Differential Revision: D21270127

fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa
2020-05-01 12:00:36 -07:00
Rohan Varma
4ff4119d45 [rpc] Move _set_rpc_backand and RpcBackendOptions to use float instead of timedelta (#37027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027

The RPC timeout passed into rpc_sync and rpc_async after the below
change is now float, so we should make these APIs consistent.
ghstack-source-id: 102971906

Test Plan:
Existing unittests, also added unittest testing specific timeout set
in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling.

Differential Revision: D21125171

fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1
2020-04-27 19:38:06 -07:00
Omkar Salpekar
4025729e88 [1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636

Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072

Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.

In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an  RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).

This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.

Differential Revision: D20019236

fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
2020-03-20 20:07:47 -07:00