Prints ranges of ranks succinctly.
e.g.
For a strided list of ranks, summarizes down to start:stop:step
```
0:4096:512
```
Omits step if it's 1
```
0:8
```
Note: endpoints are exclusive. This may not be intuitive to everyone,
but in the first above the last rank is 3584, and in the second it is
7.
Currently, does not support combinations of striding _and_ range. (e.g.
can not generate a representation like "0:2, 4:6, ..., 12:14". Is this
needed / useful? If so it could be added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160284
Approved by: https://github.com/XilunWu
Debugs RNG desync by checking the current state on each rank in the group and summarizing the differences if any are detected.
Notes:
- used allgather instead of gather since its simpler to do this SPMD rather than add conditional behavior, though I could be convinced we only want to log on rank0.
Usage:
`check_rng_sync(generator, group)`
Prints something like this:
(cuda):
```
[rank0]:E0808 ] Generator desync detected:
[rank0]:E0808 ] Ranks (Seed, Offset) values
[rank0]:E0808 ] ------- -----------------------
[rank0]:E0808 ] 0 (456, 0)
[rank0]:E0808 ] 1 (123, 4)
[rank0]:E0808 ] 2-3 (123, 0)
```
(cpu):
```
[rank2]:E0810 ] Generator desync detected:
[rank2]:E0810 ] Ranks Generator State Hash values
[rank2]:E0810 ] ------- -----------------------------
[rank2]:E0810 ] 0 7633364531954955665
[rank2]:E0810 ] 1 8807615394212033278
[rank2]:E0810 ] 2-3 -6150027303226666531
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160283
Approved by: https://github.com/ezyang
This is the first PR of a series in an attempt to re-submit #134592 as smaller PRs.
In distributed tests:
- Ensure all files which should call run_tests do call run_tests.
- Raise a RuntimeError on tests which have been disabled (not run)
- Remove any remaining uses of "unittest.main()""
Cc @wconstab @clee2000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154628
Approved by: https://github.com/Skylion007
Summary:
Details in T133020932
First commit of collective utils library. Ported over from model store, removed scuba logging, error_trait and all dependencies on modelstore.
Test Plan: In the following diffs.
Differential Revision: D45545970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101037
Approved by: https://github.com/H-Huang