pytorch/torch/distributed
Andrew Gu c30659ffcc [ZeRO] (Reland) Add ctor support for multiple param groups (#72932)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72578.

**Overview**
Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)).

To address this, I
- added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU;
- moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank.

**Test Plan**
- I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs.
- I added the `ciflow/win` label to run the failing Windows CI test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932

Reviewed By: rohan-varma

Differential Revision: D34281482

Pulled By: awgu

fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e
(cherry picked from commit 6bea9bcc63)
2022-02-22 16:29:55 +00:00
..
_shard Revert D34284271: [TLC][checkpoint] Add unit test for StatefulComponentCheckpointAgent 2022-02-19 21:28:55 +00:00
_sharded_tensor [reland] Create torch.distributed._shard package. (#72141) 2022-02-02 06:58:20 +00:00
_sharding_spec [reland] Create torch.distributed._shard package. (#72141) 2022-02-02 06:58:20 +00:00
algorithms [Join][BE] Fix typo; remove obsolete method (#72886) 2022-02-16 15:03:09 +00:00
autograd
benchmarks Add lint for unqualified type: ignore (#56290) 2021-04-21 08:07:23 -07:00
elastic [codemod][type-comments] Convert type comments in api.py (#73084) 2022-02-19 00:31:45 +00:00
fsdp Revert D33919683: [FSDP] Implement local_state_dict and load_local_state_dict 2022-02-20 02:32:48 +00:00
launcher (torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749) 2021-11-05 12:18:46 -07:00
nn Revert D33716716: [pytorch][PR] Added remove_duplicate parameter to nn.Module 2022-02-03 09:04:29 +00:00
optim [ZeRO] (Reland) Add ctor support for multiple param groups (#72932) 2022-02-22 16:29:55 +00:00
pipeline Remove dtype from torch.Storage and use only torch.ByteStorage (#62030) 2021-10-05 13:50:34 -07:00
rpc [distributed] Make rref_proxy._invoke_rpc trully async when needed. (#70206) 2022-01-19 23:37:15 +00:00
__init__.py Add pybind trampoline for ProcessGroup and Work (#66338) 2021-10-11 06:41:06 -07:00
argparse_util.py [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214) 2021-04-16 13:38:23 -07:00
constants.py make ProcessGroupDefaultTimeout the same as python (#56549) 2021-04-21 17:56:05 -07:00
CONTRIBUTING.md Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801) 2021-11-04 08:54:31 -07:00
distributed_c10d.py Stop writing logs to root logger (#72649) 2022-02-11 21:30:53 +00:00
launch.py Introduce the torchrun entrypoint (#64049) 2021-08-26 20:17:48 -07:00
remote_device.py Basic implementation of ShardedLinear using ShardedTensor. (#64128) 2021-09-20 18:31:11 -07:00
rendezvous.py Update _create_c10d_store to check port value (#71863) 2022-01-26 22:29:33 +00:00
run.py (torch/elastic) add fqdn hostname to error printout (#66182) 2021-10-07 01:40:02 -07:00