pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Howard Huang	7b376bf844	Remove ProcessGroup from TensorPipeAgent initialization (#68128 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68128 Reland of D31762735 (`0cbfd466d2`). This diff was originally reverted due to failure in test_send_export_type_through_rpc_with_custom_pickler. I updated rpc_pickler_test.py to prevent a race condition where processes were not registering their pickler before handling their rpc_sync calls. Test Plan: rpc_pickler_test file: buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test //caffe2/torch/fb/training_toolkit/backend/metrics/collectors/fbdata_aggregator/tests:batch_collector_test -- --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx rpc_pickler stress test: buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test -- --exact 'caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test - test_send_export_type_through_rpc_with_custom_pickler (caffe2.torch.fb.training_toolkit.backend.metrics.tests.rpc_pickler_test.CythonTypeRpcSpawnTest)' --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx --jobs 18 --stress-runs 10 --record-results Reviewed By: mrshenli Differential Revision: D32316077 fbshipit-source-id: e58de2335fbaa3ab46d46fe222c659197633a5e4	2021-11-11 12:28:55 -08:00
Howard Huang	9fb3ba9d7b	Revert D31762735 (#67924 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67924 This diff reverts the changes made in D31762735 (`0cbfd466d2`) Test Plan: Wait for CI Reviewed By: derekmod-fb Differential Revision: D32214744 fbshipit-source-id: e0a65b6a31a88216ae1243549fcbc901ef812374	2021-11-06 17:34:13 -07:00
Howard Huang	0cbfd466d2	Remove ProcessGroup from TensorPipeAgent initialization (#66708 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66708 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D31762735 Pulled By: H-Huang fbshipit-source-id: 9f3879fca6b8258f7e6171b14d2c1d6cce21627d	2021-11-01 14:15:27 -07:00
Howard Huang	b3781f0244	Remove faulty process group agent logic (#62409 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62409 This a reland of #61907 because removing process_group_agent.h / cpp broke facebook specific tests. I will remove the files and update the internal test code in a separate PR. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29990001 Pulled By: H-Huang fbshipit-source-id: 2ee333322247d8b72691152308c3297e8c0c006d	2021-07-30 08:12:48 -07:00
Howard Huang	a15fff0a7f	Revert D29794666: Remove faulty process group code Test Plan: revert-hammer Differential Revision: D29794666 (`afe3644321`) Original commit changeset: 0b35191cc072 fbshipit-source-id: 6467bc5100f4115f2fdb385e205740cd68c89743	2021-07-28 10:15:34 -07:00
Howard Huang	afe3644321	Remove faulty process group code (#61907 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61907 Removing the code for faulty process group agent since it was replaced by faulty tensorpipe agent Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29794666 Pulled By: H-Huang fbshipit-source-id: 0b35191cc07220b6774ecacc8d004f25fd2e87f0	2021-07-27 07:37:40 -07:00
Howard Huang	e8d2916b84	Add faulty tensorpipe implementation (#61421 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61421 This PR adds the faulty tensorpipe agent implementation and replaces all faulty process group agent tests with it. The faulty tensorpipe agent code is very similar to that of faulty process group agent. It allows the user to fail or delay certain types of rpc messages, which is used in the faulty agent tests. These changes are needed to deprecate the process group rpc backend. Summary of changes: - Add faulty tensorpipe agent class - Update tensorpipe pipeWrite function to allow to be overwritten and add delay - Update test backend registry and faulty agent tests to use the FAULTY_TENSORPIPE_AGENT backend. This effects all faulty agent tests, here a few of them as sample commands: `pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_verify_backend_options` `pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_no_faulty_messages` `pytest test/distributed/rpc/test_faulty_agent.py -vs -k test_builtin_remote_message_dropped_timeout` Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29773739 Pulled By: H-Huang fbshipit-source-id: 6b2bc366735d70b79943d4207f454bc9555bbf5f	2021-07-20 13:54:30 -07:00
Shen Li	c7b1979b6b	Use Store collect and verify names in all RPC agents (#53209 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53209 closes #40048 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D26791524 Pulled By: mrshenli fbshipit-source-id: fc75589f9707014334fcfae6f05af3c04217783b	2021-03-07 16:51:46 -08:00
Xu Zhao	49eb82a7b2	Fix type annotation errors in torch.distributed.* directory (#47531 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47531 This is part of a stack of PRs that fixes mypy typing errors in the torch.distributed.* directory. Test Plan: python test_type_hints.py -v TestTypeHints.test_run_mypy Imported from OSS Reviewed By: walterddr Differential Revision: D24952499 fbshipit-source-id: b193171e28c2211a71d28a544fa44770bf938a1e	2020-11-16 23:23:13 -08:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Rohan Varma	d639418307	Add timeout injection to faulty agent for testing (#37485 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37485 Adds arbitrary timeout injection to faulty RPC agent. This is to better test scenarios that need information about how long-running RPCs, such as properly testing RPC timeouts and the profiler in all scenarios. This is done by overriding ProcessGroupAgent's `enqueueSend()` function to inject the timeout. Determining which messages to timeout is done similar to the existing `faulty_messages` by having the user specify a mapping of message to timeout. Added unit tests that verify RPC timeouts work with builtin + TorchScript functions, which was not tested before. ghstack-source-id: 103341662 Test Plan: Added unit tests in `FaultyRpcAgentTest`. Differential Revision: D21296537 fbshipit-source-id: 1dbc21aee14e49780272634e9cbb2b5a448f2896	2020-05-01 23:48:28 -07:00
Rohan Varma	c0a985fcd6	Allow customizing retryable message types in Faulty agent tests (#37450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37450 It doesn't seem like we could customize the retryable message types by passing faulty_messages into dist_utils, as the `FaultyRpcAgentTestFixture` overrode the `rpc_backend_options` function and provided the default list of retryable message types. Needed to fix this as part of adding timeout injection support as mentioned in https://github.com/pytorch/pytorch/issues/36272 ghstack-source-id: 103287164 Test Plan: `buck test mode/dev-nosan //caffe2/test/distributed/rpc/faulty_agent:rpc_spawn_faulty -- --print-passing-details` Differential Revision: D21270127 fbshipit-source-id: e5dd847dcf92f14b490f84e9ee79291698b85ffa	2020-05-01 12:00:36 -07:00
Rohan Varma	4ff4119d45	[rpc] Move _set_rpc_backand and RpcBackendOptions to use float instead of timedelta (#37027 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37027 The RPC timeout passed into rpc_sync and rpc_async after the below change is now float, so we should make these APIs consistent. ghstack-source-id: 102971906 Test Plan: Existing unittests, also added unittest testing specific timeout set in ProcessGroupRpcBackendOptions and the dispatch rpc backend options handling. Differential Revision: D21125171 fbshipit-source-id: a5894b8ce31d2926f2c3d323d1cda4d54b30cef1	2020-04-27 19:38:06 -07:00
Omkar Salpekar	4025729e88	[1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636 Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116, https://github.com/pytorch/pytorch/issues/33072 Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages. In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish). This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected. Differential Revision: D20019236 fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48	2020-03-20 20:07:47 -07:00

15 Commits