pytorch/test/distributed
Andrew Gu c30659ffcc [ZeRO] (Reland) Add ctor support for multiple param groups (#72932)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.

**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).

To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.

**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932

Reviewed By: rohan-varma

Differential Revision: D34281482

Pulled By: awgu

fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit 6bea9bcc63)
2022-02-22 16:29:55 +00:00
..
_shard [PT-D][Sharded Tensor] new init api for local tensor and sharding spec auto inference (#72733) 2022-02-16 17:42:39 +00:00
algorithms [BE] move init_multigpu_helper to common_distributed (#67050) 2021-10-22 17:16:11 -07:00
bin Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
elastic Revise the socket implementation of c10d (#68226) 2021-11-16 20:49:25 -08:00
fsdp Revert D33919683: [FSDP] Implement local_state_dict and load_local_state_dict 2022-02-20 02:32:48 +00:00
launcher [torchelastic][1/n] Fix caffe2.test.distributed.launcher.api_test flaky tests (#68624) 2021-11-19 15:23:30 -08:00
nn/jit Have test classes extend from common_utils.TestCase, not unittest.TestCase (#66900) 2021-10-19 16:54:05 -07:00
optim [ZeRO] (Reland) Add ctor support for multiple param groups (#72932) 2022-02-22 16:29:55 +00:00
pipeline/sync [skip ci] set more tests with owners for distributed and elastic (#67583) 2021-11-01 12:26:03 -07:00
rpc Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
argparse_util_test.py [skip ci] set more tests with owners for distributed and elastic (#67583) 2021-11-01 12:26:03 -07:00
test_c10d_common.py [BE] rename some tests in test_c10d_common (#67828) 2021-11-18 17:14:58 -08:00
test_c10d_gloo.py no longer coalesce sparse COO tensors before comparison (#69751) 2022-02-17 02:33:08 +00:00
test_c10d_nccl.py Implement scatter primitive for ProcessGroupNCCL (#70029) 2022-01-27 19:37:55 +00:00
test_c10d_spawn_gloo.py [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786) 2021-12-06 13:38:58 -08:00
test_c10d_spawn_nccl.py [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786) 2021-12-06 13:38:58 -08:00
test_c10d_spawn.py [PyTorch][Distributed] Enable Reduce Scatter and modify all_to_all for sharded linear with more test cases. (#68786) 2021-12-06 13:38:58 -08:00
test_data_parallel.py no longer coalesce sparse COO tensors before comparison (#69751) 2022-02-17 02:33:08 +00:00
test_distributed_spawn.py Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
test_launcher.py Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
test_nccl.py [NCCL] Patch bfloat16 support (#67843) 2021-11-09 13:46:13 -08:00
test_pg_wrapper.py Add test owner to distributed files starting with test_ (#66797) 2021-10-19 10:55:20 -07:00
test_store.py Add support for deleteKey for FileStore (#69953) 2022-01-07 06:20:59 -08:00