Summary:
When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52941
Reviewed By: albanD
Differential Revision: D26994287
Pulled By: mrshenli
fbshipit-source-id: 287537495fc13361104a4460f5bcd79a208b5d8d
Summary:
Currently there is some code that intends to skip distributed tests if
the distributed module is not built. However, they are missing in some
test files; and in some other test files they are checked after
distributed module is imported, which leads to failure. This is
generating a lot of headaches when testing minimal builds locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52945
Reviewed By: anjali411
Differential Revision: D26848241
Pulled By: ezyang
fbshipit-source-id: 983a848844add40869a86f3c9413503a3659b115
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41769
Currently the tests in `test_distributed` only work with the `fork` mode multiprocessing, this PR introduces support for `spawn` mode multiprocessing as well (while keeping the `fork` mode intact).
Motivations for the change:
1) Spawn multiprocessing is the default on MacOS, so it better emulates how MacOS users would use distributed
2) With python 3.8+, spawn is the default on linux, so we should have test coverage for this
3) PT multiprocessing suggests using spawn/forkserver over fork, for sharing cuda tensors: https://pytorch.org/docs/stable/multiprocessing.html
4) Spawn is better supported with respect to certain sanitizers such as TSAN, so adding this sanitizer coverage may help us uncover issues.
How it is done:
1) Move `test_distributed` tests in `_DistTestBase` class to a shared file `distributed_test` (similar to how the RPC tests are structured)
2) For `Barrier`, refactor the setup of temp directories, as the current version did not work with spawn, each process would get a different randomly generated directory and thus would write to different barriers.
3) Add all the relevant builds to run internally and in OSS.
Running test_distributed with spawn mode in OSS can be done with:
`python test/run_test.py -i distributed/test_distributed_spawn -v`
Reviewed By: izdeby
Differential Revision: D22408023
fbshipit-source-id: e206be16961fd80438f995e221f18139d7e6d2a9