pytorch/torch/distributed
Andrew Gu 9012e8d65a [ZeRO][BE] Clean up ZeRO tests (#73842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842

**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.

The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
    - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
    - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
    - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
    - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
    - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
    - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.

**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34675709

Pulled By: awgu

fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
2022-03-08 13:15:20 +00:00
..
_shard [shard] fix init_from_local_shards issue with deepcopy (#73400) 2022-03-03 21:37:20 +00:00
_sharded_tensor [reland] Create torch.distributed._shard package. (#72141) 2022-02-02 06:58:20 +00:00
_sharding_spec [reland] Create torch.distributed._shard package. (#72141) 2022-02-02 06:58:20 +00:00
algorithms [Model Averaging] Add a reference to hierarchical SGD (#73823) 2022-03-08 05:56:17 +00:00
autograd
benchmarks Add lint for unqualified type: ignore (#56290) 2021-04-21 08:07:23 -07:00
elastic (torch/elastic) skip logging structured error info if error_file is not set (#73477) 2022-03-01 19:31:44 +00:00
fsdp [FSDP] Generalize fsdp_modules() (#73553) 2022-03-08 07:36:56 +00:00
launcher (torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598) 2022-03-03 08:35:50 +00:00
nn Revert D33716716: [pytorch][PR] Added remove_duplicate parameter to nn.Module 2022-02-03 09:04:29 +00:00
optim [ZeRO][BE] Clean up ZeRO tests (#73842) 2022-03-08 13:15:20 +00:00
pipeline Remove dtype from torch.Storage and use only torch.ByteStorage (#62030) 2021-10-05 13:50:34 -07:00
rpc Don't discard stacktrace when rewriting AttributeError (#73720) 2022-03-04 01:29:43 +00:00
__init__.py Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166) 2022-02-24 02:33:05 +00:00
argparse_util.py [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214) 2021-04-16 13:38:23 -07:00
constants.py make ProcessGroupDefaultTimeout the same as python (#56549) 2021-04-21 17:56:05 -07:00
CONTRIBUTING.md Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801) 2021-11-04 08:54:31 -07:00
distributed_c10d.py Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166) 2022-02-24 02:33:05 +00:00
launch.py Introduce the torchrun entrypoint (#64049) 2021-08-26 20:17:48 -07:00
remote_device.py Basic implementation of ShardedLinear using ShardedTensor. (#64128) 2021-09-20 18:31:11 -07:00
rendezvous.py Update _create_c10d_store to check port value (#71863) 2022-01-26 22:29:33 +00:00
run.py (torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598) 2022-03-03 08:35:50 +00:00