pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Andrew Gu c30659ffcc [ZeRO] (Reland) Add ctor support for multiple param groups (#72932 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/72578. Overview Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)). To address this, I - added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU; - moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank. Test Plan - I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs. - I added the `ciflow/win` label to run the failing Windows CI test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932 Reviewed By: rohan-varma Differential Revision: D34281482 Pulled By: awgu fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e (cherry picked from commit `6bea9bcc63`)		2022-02-22 16:29:55 +00:00
..
_shard	Revert D34284271: [TLC][checkpoint] Add unit test for StatefulComponentCheckpointAgent	2022-02-19 21:28:55 +00:00
_sharded_tensor	[reland] Create torch.distributed._shard package. (#72141 )	2022-02-02 06:58:20 +00:00
_sharding_spec	[reland] Create torch.distributed._shard package. (#72141 )	2022-02-02 06:58:20 +00:00
algorithms	[Join][BE] Fix typo; remove obsolete method (#72886 )	2022-02-16 15:03:09 +00:00
autograd
benchmarks	Add lint for unqualified `type: ignore` (#56290 )	2021-04-21 08:07:23 -07:00
elastic	[codemod][type-comments] Convert type comments in api.py (#73084 )	2022-02-19 00:31:45 +00:00
fsdp	Revert D33919683: [FSDP] Implement local_state_dict and load_local_state_dict	2022-02-20 02:32:48 +00:00
launcher	(torch/elastic) fix scale down bug caused by calling rdzv_handler.shutdown() on premature agent failures (#67749 )	2021-11-05 12:18:46 -07:00
nn	Revert D33716716: [pytorch][PR] Added remove_duplicate parameter to `nn.Module`	2022-02-03 09:04:29 +00:00
optim	[ZeRO] (Reland) Add ctor support for multiple param groups (#72932 )	2022-02-22 16:29:55 +00:00
pipeline	Remove dtype from torch.Storage and use only torch.ByteStorage (#62030 )	2021-10-05 13:50:34 -07:00
rpc	[distributed] Make rref_proxy._invoke_rpc trully async when needed. (#70206 )	2022-01-19 23:37:15 +00:00
__init__.py	Add pybind trampoline for ProcessGroup and Work (#66338 )	2021-10-11 06:41:06 -07:00
argparse_util.py	[19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214 )	2021-04-16 13:38:23 -07:00
constants.py	make ProcessGroupDefaultTimeout the same as python (#56549 )	2021-04-21 17:56:05 -07:00
CONTRIBUTING.md	Update distributed contributing guide to show how to run one test in test_distributed_spawn (#67801 )	2021-11-04 08:54:31 -07:00
distributed_c10d.py	Stop writing logs to root logger (#72649 )	2022-02-11 21:30:53 +00:00
launch.py	Introduce the torchrun entrypoint (#64049 )	2021-08-26 20:17:48 -07:00
remote_device.py	Basic implementation of ShardedLinear using ShardedTensor. (#64128 )	2021-09-20 18:31:11 -07:00
rendezvous.py	Update _create_c10d_store to check port value (#71863 )	2022-01-26 22:29:33 +00:00
run.py	(torch/elastic) add fqdn hostname to error printout (#66182 )	2021-10-07 01:40:02 -07:00