Commit Graph

9 Commits

Author SHA1 Message Date
Will Constable
fd60117051 [C10D] add _summarize_ranks util (#160284)
Prints ranges of ranks succinctly.

e.g.

For a strided list of ranks, summarizes down to start:stop:step
```
0:4096:512
```

Omits step if it's 1
```
0:8
```

Note: endpoints are exclusive. This may not be intuitive to everyone,
but in the first above the last rank is 3584, and in the second it is
7.

Currently, does not support combinations of striding _and_ range.  (e.g.
can not generate a representation like "0:2, 4:6, ..., 12:14".  Is this
needed / useful? If so it could be added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160284
Approved by: https://github.com/XilunWu
2025-08-28 00:17:53 +00:00
Will Constable
6da11d9aaf [C10D] Add check_rng_sync util (#160283)
Debugs RNG desync by checking the current state on each rank in the group and summarizing the differences if any are detected.

Notes:
- used allgather instead of gather since its simpler to do this SPMD rather than add conditional behavior, though I could be convinced we only want to log on rank0.

Usage:
`check_rng_sync(generator, group)`

Prints something like this:

(cuda):
```
[rank0]:E0808 ] Generator desync detected:
[rank0]:E0808 ] Ranks    (Seed, Offset) values
[rank0]:E0808 ] -------  -----------------------
[rank0]:E0808 ] 0        (456, 0)
[rank0]:E0808 ] 1        (123, 4)
[rank0]:E0808 ] 2-3      (123, 0)
```

(cpu):
```
[rank2]:E0810 ] Generator desync detected:
[rank2]:E0810 ] Ranks      Generator State Hash values
[rank2]:E0810 ] -------  -----------------------------
[rank2]:E0810 ] 0                  7633364531954955665
[rank2]:E0810 ] 1                  8807615394212033278
[rank2]:E0810 ] 2-3               -6150027303226666531
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160283
Approved by: https://github.com/ezyang
2025-08-13 23:05:29 +00:00
Xuehai Pan
6d5c789ad5 [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555
Approved by: https://github.com/ezyang
ghstack dependencies: #144551, #144554
2025-06-24 04:53:54 +00:00
Anthony Barbier
3f34d26040 Add __main__ guards to distributed tests (#154628)
This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs.

In distributed tests:

- Ensure all files which should call run_tests do call run_tests.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""

Cc @wconstab @clee2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628
Approved by: https://github.com/Skylion007
2025-06-04 14:39:57 +00:00
Tom Ritchford
d25e6e623f Fix unused Python variables in test/[a-d]* (#134665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665
Approved by: https://github.com/albanD
2024-12-13 22:13:12 +00:00
Yuanhao Ji
e3effa5855 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-17 06:46:02 +00:00
PyTorch MergeBot
52be63eb2c Revert "Enable UFMT on all of test/distributed (#123539)"
This reverts commit 89ac37fe91.

Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))
2024-04-16 06:33:21 +00:00
Yuanhao Ji
89ac37fe91 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-16 03:23:56 +00:00
Amr Elshennawy
968b7b5e0f Initial commit of collective_utils (#101037)
Summary:
Details in T133020932
First commit of collective utils library. Ported over from model store, removed scuba logging, error_trait and all dependencies on modelstore.

Test Plan: In the following diffs.

Differential Revision: D45545970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101037
Approved by: https://github.com/H-Huang
2023-06-27 02:15:16 +00:00