mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 00:20:18 +01:00
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.
**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).
To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.
**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932
Reviewed By: rohan-varma
Differential Revision: D34281482
Pulled By: awgu
fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit
|
||
|---|---|---|
| .. | ||
| _shard | ||
| algorithms | ||
| bin | ||
| elastic | ||
| fsdp | ||
| launcher | ||
| nn/jit | ||
| optim | ||
| pipeline/sync | ||
| rpc | ||
| argparse_util_test.py | ||
| test_c10d_common.py | ||
| test_c10d_gloo.py | ||
| test_c10d_nccl.py | ||
| test_c10d_spawn_gloo.py | ||
| test_c10d_spawn_nccl.py | ||
| test_c10d_spawn.py | ||
| test_data_parallel.py | ||
| test_distributed_spawn.py | ||
| test_launcher.py | ||
| test_nccl.py | ||
| test_pg_wrapper.py | ||
| test_store.py | ||