pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Andrew Gu 9012e8d65a [ZeRO][BE] Clean up ZeRO tests (#73842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842 Overview This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: `1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)` - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. Questions - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34675709 Pulled By: awgu fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb (cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)		2022-03-08 13:15:20 +00:00
..
_shard	[shard] fix init_from_local_shards issue with deepcopy (#73400 )	2022-03-03 21:37:20 +00:00
_sharded_tensor	[reland] Create torch.distributed._shard package. (#72141 )	2022-02-02 06:58:20 +00:00
_sharding_spec	[reland] Create torch.distributed._shard package. (#72141 )	2022-02-02 06:58:20 +00:00
algorithms	[Model Averaging] Add a reference to hierarchical SGD (#73823 )	2022-03-08 05:56:17 +00:00
autograd
benchmarks	Add lint for unqualified `type: ignore` (#56290 )	2021-04-21 08:07:23 -07:00
elastic	(torch/elastic) skip logging structured error info if error_file is not set (#73477 )	2022-03-01 19:31:44 +00:00
fsdp	[FSDP] Generalize fsdp_modules() (#73553 )	2022-03-08 07:36:56 +00:00
launcher	(torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598 )	2022-03-03 08:35:50 +00:00
nn	Revert D33716716: [pytorch][PR] Added remove_duplicate parameter to `nn.Module`	2022-02-03 09:04:29 +00:00
optim	[ZeRO][BE] Clean up ZeRO tests (#73842 )	2022-03-08 13:15:20 +00:00
pipeline	Remove dtype from torch.Storage and use only torch.ByteStorage (#62030 )	2021-10-05 13:50:34 -07:00
rpc	Don't discard stacktrace when rewriting AttributeError (#73720 )	2022-03-04 01:29:43 +00:00
__init__.py	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 )	2022-02-24 02:33:05 +00:00
argparse_util.py	[19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214 )	2021-04-16 13:38:23 -07:00
constants.py	make ProcessGroupDefaultTimeout the same as python (#56549 )	2021-04-21 17:56:05 -07:00
CONTRIBUTING.md	Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801 )	2021-11-04 08:54:31 -07:00
distributed_c10d.py	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 )	2022-02-24 02:33:05 +00:00
launch.py	Introduce the torchrun entrypoint (#64049 )	2021-08-26 20:17:48 -07:00
remote_device.py	Basic implementation of ShardedLinear using ShardedTensor. (#64128 )	2021-09-20 18:31:11 -07:00
rendezvous.py	Update _create_c10d_store to check port value (#71863 )	2022-01-26 22:29:33 +00:00
run.py	(torch/elastic) add documentation clarifying that torchrun is a console script to torch.distributed.run (#73598 )	2022-03-03 08:35:50 +00:00