Ke Wen
9beddde1d7
Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL ( #83881 )
...
Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
Saving user from setting two env variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
Approved by: https://github.com/malfet , https://github.com/rohan-varma , https://github.com/H-Huang
2022-08-23 17:57:16 +00:00
Michael Suo
30fb2c4aba
[lint] autoformat test/cpp and torch/csrc
...
Let's have some fun.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828
Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
Ke Wen
a15f78347b
Back out "[PyTorch Distributed] Consolidate NCCL_DESYNC_DEBUG and TORCH_DISTRIBUTED_DEBUG=INFO" ( #74586 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74586
Original commit changeset: 9cc71a8ab0d4
Original Phabricator Diff: D34232827 (2aade49a14 )
Test Plan: Unit tests
Reviewed By: junesg
Differential Revision: D35055166
fbshipit-source-id: 6c2bcc0f1579f5f646afc9291722486b7a981269
(cherry picked from commit 48c2e544caec6fccb7f603e2126fc68abc5e5b5f)
2022-03-23 07:54:45 +00:00
Ke Wen
2aade49a14
[PyTorch Distributed] Consolidate NCCL_DESYNC_DEBUG and TORCH_DISTRIBUTED_DEBUG=INFO ( #73257 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73257
Infer desync debug from whether TORCH_DISTRIBUTED_DEBUG >= INFO
Test Plan:
1. When TORCH_DISTRIBUTED_DEBUG=INFO:
1.1 Catch mismatched collectives (e.g. broadcast vs reduce) - passed
1.2 Catch mismatched collective sizes - passed
2. QPS test: no performance regression - passed
Reviewed By: rohan-varma
Differential Revision: D34232827
fbshipit-source-id: 9cc71a8ab0d416a2037daca08930e590688e1d38
(cherry picked from commit 0322c80560736e173a5868e7077171a410116888)
2022-02-26 01:08:45 +00:00
Can Balioglu
e1db2f13ce
Refactor TORCH_DISTRIBUTED_DEBUG implementation ( #73166 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00