Commit Graph

5 Commits

Author SHA1 Message Date
Ke Wen
9beddde1d7 Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881)
Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
Saving user from setting two env variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang
2022-08-23 17:57:16 +00:00
Michael Suo
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
Ke Wen
a15f78347b Back out "[PyTorch Distributed] Consolidate NCCL_DESYNC_DEBUG and TORCH_DISTRIBUTED_DEBUG=INFO" (#74586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74586

Original commit changeset: 9cc71a8ab0d4

Original Phabricator Diff: D34232827 (2aade49a14)

Test Plan: Unit tests

Reviewed By: junesg

Differential Revision: D35055166

fbshipit-source-id: 6c2bcc0f1579f5f646afc9291722486b7a981269
(cherry picked from commit 48c2e544caec6fccb7f603e2126fc68abc5e5b5f)
2022-03-23 07:54:45 +00:00
Ke Wen
2aade49a14 [PyTorch Distributed] Consolidate NCCL_DESYNC_DEBUG and TORCH_DISTRIBUTED_DEBUG=INFO (#73257)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73257

Infer desync debug from whether TORCH_DISTRIBUTED_DEBUG >= INFO

Test Plan:
1. When TORCH_DISTRIBUTED_DEBUG=INFO:
1.1 Catch mismatched collectives (e.g. broadcast vs reduce) - passed
1.2 Catch mismatched collective sizes - passed
2. QPS test: no performance regression - passed

Reviewed By: rohan-varma

Differential Revision: D34232827

fbshipit-source-id: 9cc71a8ab0d416a2037daca08930e590688e1d38
(cherry picked from commit 0322c80560736e173a5868e7077171a410116888)
2022-02-26 01:08:45 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00