pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
fduwjj	85ae28b454	Reformat optim import (#90294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90294 Approved by: https://github.com/awgu	2022-12-07 07:11:12 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
fduwjj	1abe264ef0	[Upstream _NamedOptimzer] Reland PR (89480) (#90293 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Reland https://github.com/pytorch/pytorch/pull/89480/ * #90294 * __->__ #90293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90293 Approved by: https://github.com/awgu	2022-12-06 21:47:12 +00:00
PyTorch MergeBot	176b962f4b	Revert "[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480 )" This reverts commit `31ec1a1ef7`. Reverted https://github.com/pytorch/pytorch/pull/89480 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names	2022-12-06 07:22:37 +00:00
fduwjj	31ec1a1ef7	[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480 ) In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters. Now composability workstream need a FQN based way to index optimizer state_dict for parameters.. For example, SGD optimizer might have something in its `state_dict` like: ``` {'state': {0: {'momentum_buffer': tensor(...)}, {1: {'momentum_buffer': tensor(...)}, ... } 'param_groups': [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}] } ``` And in NamedOptimizer we want the `state_dict` can be: ``` {'state': {'net1.0.weight': {'momentum_buffer': tensor(...)}, {'net1.0.bias': {'momentum_buffer': tensor(...)}, ... } 'param_groups': [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}] } ``` We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer. For the next couple PR/diffs, we also need to: 1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components. 2. Make `NamedOptimizer` works well with apply_optim_in_backward 3. Upstream also `CombinedOptimizer`. Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480 Approved by: https://github.com/rohan-varma	2022-12-06 04:34:19 +00:00
Rohan Varma	404f254e20	Upstream apply_optim_in_backward from TorchRec (#87397 ) (#88539 ) Summary: Upstreaming this as part of sharing common APIs. This is just a plain move, any changes needed to support DDP / FSDP will come in follow up diffs. Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D40564646 fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539 Approved by: https://github.com/awgu	2022-11-05 18:28:07 +00:00
Rohan Varma	bd5b4e6504	[Easy] Unused var in functional_adam (#88292 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/88292 Approved by: https://github.com/awgu	2022-11-02 16:31:16 +00:00
Masaki Kozuki	5f26df0345	resubmit: "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" (#85739 ) Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](`849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)`) to a new header file and let FusedAdam use them to tame MSVC, hopefully. cc @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739 Approved by: https://github.com/ngimel	2022-09-29 16:58:59 +00:00
PyTorch MergeBot	7167996346	Revert "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" This reverts commit `4615d1bcfa`. Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds	2022-09-27 16:59:35 +00:00
Masaki Kozuki	4615d1bcfa	resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507 Approved by: https://github.com/ngimel	2022-09-23 18:56:00 +00:00
PyTorch MergeBot	e505360eb8	Revert "[mta] APEX style Fused Adam (#81705 )" This reverts commit `7a6c4d0c50`. Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come	2022-09-22 19:37:29 +00:00
Masaki Kozuki	7a6c4d0c50	[mta] APEX style Fused Adam (#81705 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel	2022-09-20 17:18:33 +00:00
Rodrigo Kumpera	65dc5dd3f3	[c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134 ) Those functions enable membership introspection into a ProcessGroup. A common scenario that needs this is library code that consumes a PG but doesn't create it, which means it likely doesn't know the global ranks used to create it. Translating from local to global is necessary when using c10d collectives like broadcast so if your library code adopts the convention of using local rank 0, it needs to the following: ```python import torch.distributed as dist my_pg: dist.ProcessGroup = ... def my_library_bcast(tensor) dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg) ``` This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134 Approved by: https://github.com/rohan-varma	2022-08-30 17:45:00 +00:00
joncrall	b136f3f310	More doctest refinements. (#83317 ) Follow up to #82797 Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way. @ezyang @vadimkantorov Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317 Approved by: https://github.com/ezyang	2022-08-22 20:07:26 +00:00
Rob Zinkov	ff75562cff	Adding maximize to rprop (#81864 ) Added the maximize flag #68052 to rprop optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81864 Approved by: https://github.com/albanD	2022-08-16 08:19:46 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
ProGamerGov	71d50f4f89	Change docstring type callable to Callable for consistency (#82487 ) ### Description Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function. ### Testing There shouldn't be any testing required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487 Approved by: https://github.com/albanD	2022-08-01 17:26:09 +00:00
Jerome	547e499731	Enable Zero1's ddp_with_overlap for hpu backend (#80438 ) Enable zero with ddp overlap feature along with a simple interface to insert functional optimizer to the map Signed-off-by: Jerome <janand@habana.ai> Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/80438 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-07-18 15:05:27 +00:00
anjali411	93912b1a73	Add __all__ to torch.distributed submodules (#80523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80523 Approved by: https://github.com/rohan-varma	2022-07-11 06:54:24 +00:00
PyTorch MergeBot	0b8a5ca01b	Revert "Adding maximize to rprop (#80335 )" This reverts commit `495aa9bc3a`. Reverted https://github.com/pytorch/pytorch/pull/80335 on behalf of https://github.com/albanD due to Broke rocm and windows test	2022-07-08 13:34:02 +00:00
Rob Zinkov	495aa9bc3a	Adding maximize to rprop (#80335 ) Added the maximize flag #68052 to rprop optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80335 Approved by: https://github.com/albanD	2022-07-08 08:04:38 +00:00
Rob Zinkov	a1fd5b4273	Adding maximize to RMSprop (#80326 ) Added the maximize flag #68052 to RMSprop optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80326 Approved by: https://github.com/albanD	2022-07-08 08:04:26 +00:00
wayi1	f76bb88205	fix docstring of PostLocalSGDOptimizer (#80855 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80855 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2022-07-05 14:58:35 +00:00
Michael Carilli	ba27ee9e8f	[CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862 ) Near term fix for https://github.com/pytorch/pytorch/issues/76368. Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic? A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph. Q. Ok, why not just do the capture-safe approach with device-side state variables all the time? A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling. Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here? A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like ```python graph.will_use_optimizer(opt) graph.capture_begin() ... ``` but that seems clunkier than an optimizer constructor arg. I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach. Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862 Approved by: https://github.com/ezyang	2022-06-13 01:56:47 +00:00
Olga Andreeva	b1ae519df9	Added functionality for post_local SGD (#78988 ) Fixes #74556 Added functionality to save and restore step counter for model averager. Added a unittest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78988 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-06-09 17:47:04 +00:00
Rob Zinkov	2a496e2f80	Adding maximize to Adamax (#77409 ) Added the maximize flag #68052 to Adamax optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77409 Approved by: https://github.com/albanD	2022-05-16 17:34:44 +00:00
Rob Zinkov	6642e88ad2	Adding maximize flag to Adagrad This adds maximize to Adagrad (#68052) along with updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75968 Approved by: https://github.com/albanD	2022-04-20 08:29:03 +00:00
Haijunlv	08f3b95857	fix PostLocalSGDOptimizer and ModelAverager average bug Fixes #74157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74894 Approved by: https://github.com/rohan-varma, https://github.com/wayi1	2022-04-13 11:41:27 +00:00
francescocastelli	58a44523c1	Add maximize flag to Adadelta Added the maximize flag to Adadelta optimizer (#68052) and adjusted tests to take maximize into account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75330 Approved by: https://github.com/cpuhrsch	2022-04-08 20:32:35 +00:00
wayi1	189e72babe	[Model Averaging] Fix post_localSGD_optimizer I find that the original implementation of `post_localSGD_optimizer.step()` is incorrect: Whenever `averager.average_parameters()` is called, the built-in step counter will be increased. Therefore, this should only be called exactly once per `optimizer.step()`. However, if a model has multiple param groups or params, the current implementation will call `averager.average_parameters()` multiple times and over-increase the step counter. Relevant proposals since hierarchical SGD can be supported on `post_localSGD_optimizer`: https://github.com/pytorch/pytorch/issues/73382, https://github.com/pytorch/pytorch/issues/71325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74737 Approved by: https://github.com/mrshenli	2022-04-05 21:10:24 +00:00
Andrew Gu	522041a0fd	[FSDP] Add full optim state dict (#74215 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74215 ### Overview of API This PR introduces full optimizer state dict checkpointing. - This allows users to save the optimizer state for a `torch.nn.Module` (not necessarily a `FullyShardedDataParallel` instance) that contains `FullyShardedDataParallel` instances and later load that optimizer state. - This supports loading to a module with a different world size, but the `FSDP` wrapping scheme must be the same. To save the optimizer state, run the following (on all ranks): ``` model: torch.nn.Module = ... optim = torch.optim.Adam(model.parameters(), ...) # Train for some steps... full_osd = FSDP.full_optim_state_dict(model, optim) # returns non-empty dict only on rank 0 if rank == 0: torch.save(full_osd, ...) ``` To load the optimizer state, run the following (on all ranks): ``` new_model: torch.nn.Module = ... # may use different world size full_osd = torch.load(...) sharded_osd = FSDP.shard_full_optim_state_dict(full_osd, new_model) optim = torch.optim.Adam(new_model.parameters(), ...) optim.load_state_dict(sharded_osd) ``` To support multiple parameter groups, we require using an additional argument `optim_input`, which is the first argument that the user passes into the optimizer constructor. ``` optim_input = ... optim = torch.optim.Adam(optim_input, ...) FSDP.full_optim_state_dict(model, optim, optim_input) # one more argument ... new_optim_input = ... new_optim = torch.optim.Adam(new_optim_input, ...) FSDP.shard_full_optim_state_dict(full_osd, new_model, new_optim_input) # one more argument ``` One caveat is that the user should be careful of generators, which are exhausted after their first use. The `optim_input` passed into the `FSDP` APIs should be refreshed version of the generator if using generators. ### Test Plan `full_optim_state_dict()` - [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is `model.parameters()`. - [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is multiple parameter groups (changing parameter order). `shard_full_optim_state_dict()` - [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is `model.parameters()`. - [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is multiple parameter groups (changing parameter order). - [x] `shard_full_optim_state_dict()` raises a `ValueError` when changing the `FSDP` wrapping scheme. On the AWS cluster, the TTS contribution for these tests is ~45 seconds. ### Developer Notes Relaxing the Problem For optimizer state checkpointing, we have relaxed the problem to not support changing the `FSDP` wrapping scheme between save and load time. It is unclear how to solve without this relaxation. This was the least restrictive way to relax the problem since it does not affect most expected use cases. Rather, the expected change between save and load time is the world size, which this implementation does support. Even with the relaxation, the `optim_input` argument is necessary to determine the `flat_param_id_to_param` mapping, which is important to know which parameter IDs in the flattened space correspond to `FlatParameter`s that hence need to be unflattened. Differences with Local Equivalent Suppose `full_osd = full_optim_state_dict()` and `local_osd = state_dict()` for a purely local equivalent. The difference between `full_osd` and `local_osd` is that the parameter IDs of unflattened parameters comprising a single flattened parameter are always consecutive in `full_osd`, while they may be non-consecutive in `local_osd`. Suppose in the following that each layer has 1 parameter `param`: ``` FSDP(model) layer1 FSDP(layer2) layer3 ``` `layer1.param` and `layer3.param` are flattened and attributed to `model`. `layer2.param` is flattened and attributed to itself. - In `local_osd`, the parameter IDs would be `0: layer1.param`, `1: layer2.param`, and `2: layer3.param`. - In `full_osd`, the parameter IDs would be `0: layer1.param`, `1: layer3.param`, and `2: layer2.param`. (Parameter IDs of unflattened parameters sharing a flattened parameter are consecutive.) The idea is that as long as `full_optim_state_dict()` and `shard_full_optim_state_dict()` are internally consistent, then there is no need to match the local equivalent (assuming no change in `FSDP` wrapping). ### Follow-Ups API - If needed, we can follow-up this PR by adding an argument `key_by_name: bool = False` to both methods that may be set to `True` to key parameters by `str` names instead of `int` parameter IDs. We still need to investigate if keying by name enables changing the `FSDP` wrapping scheme. Refactoring - In this optimizer state checkpointing, all optimizer state is saved to CPU on rank 0 (set as `OPTIM_TARGET_RANK`). We should unify and refactor these assumptions with model state checkpointing. Testing - The code path for unused parameters is not tested. The testing and any needed implementation fixes can be done in a follow-up. - The code path for non-tensor states (e.g. `Adam` `"step"` as `float` instead of as zero-dimension `FloatTensor`) is not tested. However, it is identical to that of zero-dimension tensor states, so I have some confidence. If needed, I can add tests for it in a follow-up. - Would I have to write my own optimizer? I do not want to introduce dependencies on third party libraries like Nvidia `apex`. - We may want to add end-to-end checkpointing tests that include both model state dict and optimizer state dict. Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D35045121 Pulled By: awgu fbshipit-source-id: 33c650dc960acbd7613d4f444a852b9f76ca4a9b (cherry picked from commit 2bbc2e344296dc455cf686f3a9b097989504be81)	2022-03-30 14:15:23 +00:00
Andrew Gu	9012e8d65a	[ZeRO][BE] Clean up ZeRO tests (#73842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842 Overview This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: `1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)` - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. Questions - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34675709 Pulled By: awgu fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb (cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)	2022-03-08 13:15:20 +00:00
Can Balioglu	e1db2f13ce	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166 This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started. ghstack-source-id: 149778566 Test Plan: Run the existing unit tests. Reviewed By: rohan-varma Differential Revision: D34371226 fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b (cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)	2022-02-24 02:33:05 +00:00
Andrew Gu	c30659ffcc	[ZeRO] (Reland) Add ctor support for multiple param groups (#72932 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/72578. Overview Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)). To address this, I - added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU; - moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank. Test Plan - I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs. - I added the `ciflow/win` label to run the failing Windows CI test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932 Reviewed By: rohan-varma Differential Revision: D34281482 Pulled By: awgu fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e (cherry picked from commit `6bea9bcc63`)	2022-02-22 16:29:55 +00:00
Nikita Shulga	84cb810b3f	Revert D34106940: [ZeRO] Add ctor support for multiple param groups Test Plan: revert-hammer Differential Revision: D34106940 (`5dd0732457`) Original commit changeset: 7e70fc0b3cec Original Phabricator Diff: D34106940 (`5dd0732457`) fbshipit-source-id: 08f846c9c02be8756475f4e0b57eb381f10c27bd (cherry picked from commit `7675497d83`)	2022-02-16 03:45:15 +00:00
wayi1	8b08478115	Fix the doc of PostLocalSGDState (#72792 ) Summary: The first arg of `PostLocalSGDState` ctor, `process_group`, cannot be empty. Here to simplify the usage, does not even create a subgroup explicitly. See the example in unit test: `4feef6c970/torch/testing/_internal/distributed/distributed_test.py (L4260)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/72792 Reviewed By: samdow Differential Revision: D34213221 Pulled By: rohan-varma fbshipit-source-id: 078343f3ee138e175bf835897f190032eb970662 (cherry picked from commit `bf90af704f`)	2022-02-15 23:47:12 +00:00
Mikayla Gawarecki	2a5aaf1c49	Optim foreach cleanup for AdamW (#70484 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70484 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767869 Pulled By: mikaylagawarecki fbshipit-source-id: 2f5273bbfeea3ed502c5d77da4bebe1674243e86 (cherry picked from commit `2dd9b77917`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	dff58d519f	Optim foreach cleanup for Rprop (#70483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70483 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767866 Pulled By: mikaylagawarecki fbshipit-source-id: ffc5ae68eeea8fa09385862b853b731554b77bcb (cherry picked from commit `3a0fe29580`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	ce3094f5f6	Optim foreach cleanup for Rmsprop (#70482 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70482 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767862 Pulled By: mikaylagawarecki fbshipit-source-id: 8e2e9c986d5a3774093a79755940372945f1b3a9 (cherry picked from commit `baea537277`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	2cb03e926f	Optim foreach cleanup for SGD (#70481 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70481 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767868 Pulled By: mikaylagawarecki fbshipit-source-id: 89b9227a4ddf99602855973cbc343c58ae3d5328 (cherry picked from commit `ffea8ddcfd`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	5f9590681d	Optim foreach cleanup for Adam (#70295 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70295 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767870 Pulled By: mikaylagawarecki fbshipit-source-id: f922f15ecb0307458c8ecee737325c42c4f3ce8b (cherry picked from commit `66233a8a3e`)	2022-02-15 18:02:08 +00:00
Andrew Gu	5dd0732457	[ZeRO] Add ctor support for multiple param groups (#72578 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72578 Overview This adds `ZeroRedundancyOptimizer` constructor support for multiple parameter groups (i.e. passing an `iterable` of `dict`s instead of an `iterable` of `torch.Tensor` as the `parameters` argument) to mirror the API for non-sharded optimizers. Fixes https://github.com/pytorch/pytorch/issues/71347 and https://github.com/pytorch/pytorch/issues/59973. This modifies `test_collect_shards()` to skip if ROCm. Test Plan I adjusted the existing constructor test, and I added a test for parity between constructing with two parameter groups up front versus constructor with one parameter group and adding the second parameter group after (via `add_param_group()`) versus a non-sharded optimizer. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34106940 Pulled By: awgu fbshipit-source-id: 7e70fc0b3cec891646e0698eaedf02ff4354c128 (cherry picked from commit `40f2d45172`)	2022-02-15 16:51:30 +00:00
Yuxin Wu	1ed4653e89	Stop writing logs to root logger (#72649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/72648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649 Reviewed By: soulitzer Differential Revision: D34172113 Pulled By: mrshenli fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf (cherry picked from commit `c14297cee6`)	2022-02-11 21:30:53 +00:00
Mikayla Gawarecki	d9acfef831	Optim foreach cleanup for Adamax (#69982 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69982 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767865 Pulled By: mikaylagawarecki fbshipit-source-id: c5efd351e359825d38b71f57a2c61a2055c3c114 (cherry picked from commit `37bb80c2d7`)	2022-02-09 16:52:13 +00:00
Mikayla Gawarecki	dabfea8363	Optim foreach cleanup for Adagrad (#69981 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69981 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767863 Pulled By: mikaylagawarecki fbshipit-source-id: 1c99abe4ac4eb2a9eb896dff4837b539b94f68e7 (cherry picked from commit `61c28d0645`)	2022-02-09 16:52:12 +00:00
Mikayla Gawarecki	8e8d170674	Optim foreach cleanup for Adadelta (#69980 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69980 - Merged `torch/optim/adadelta.py` and `torch/optim/_multitensor/adadelta.py` into `torch/optim/adadelta.py` - Moved adadelta functional forms from `torch/optim/_functional.py` and `torch/optim/_multi_tensor/_functional.py` to `torch/optim/adadelta.py` - `torch/optim/_functional.py` just imports from `torch/optim/adadelta.py` - Added a test `test_optimizers_foreach_flag` which replicates `test_multi_tensor_optimizers` in `test/test_optim.py` - Add a test `test_adadelta_new` that replicates the behavior of `test_adadelta` but with `foreach` flag instead of using the multitensor adadleta class. If we delete `_multitensor/` we could replace `test_adadelta` with this Remaining TODO: - [ ] single_tensor adadelta supports complex but multitensor does not, need to integrate the singletensor logic in multitensor and switch the `test_adadelta_complex` to test for foreach in [True, False] Test Plan: Imported from OSS Reviewed By: VitalyFedyunin, albanD Differential Revision: D33413059 Pulled By: mikaylagawarecki fbshipit-source-id: 92a9fa98705762bb6bd464261671e49aef40070e (cherry picked from commit `a008227d22`)	2022-02-09 16:52:12 +00:00
Mikayla Gawarecki	7176c92687	[optim] update step in functional and pass state_steps instead of state (#71333 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71333 Updated - Adagrad - Adamax - Adam - AdamW - RAdam make multi_tensor functionals take `state_steps: List[Tensor]` instead of taking `states: List[Dict]` make `state_steps: List[int]s -> state_steps:List[Tensor]` where each is a Singleton tensor so step can be updated within the functional (NAdam and ASGD) were updated in separate diffs to fold their handling of state into the functionals Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767872 Pulled By: mikaylagawarecki fbshipit-source-id: 9baa7cafb6375eab839917df9287c65a437891f2 (cherry picked from commit `831c02b3d0`)	2022-02-08 16:51:19 +00:00
Rohan Varma	bdcdf94bdd	[Opt Overlap] Clean up code in _OptimizerHookState (#71620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620 Remove from_functional_optim and make it the default constructor since that is the only way _OptimizerHookState is now being built. Also, no longer need to expose create_functional_optim helper function ghstack-source-id: 147577174 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33700593 fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14 (cherry picked from commit `a50b2caf0e`)	2022-01-26 19:33:49 +00:00
Rohan Varma	f5a71ec2d6	[Opt Overlap] Implement as_functional_optim and create_functional_optim (#71604 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604 Implement 2 helper functions: - as_functional_optim which takes in a torch.optim class type and arguments and creates the corresponding functional optimizer. - create_functional_optim which takes in the functional optimizer class type and constructs it. Note that as_functional_optim calls into create_functional_optim. The first will be used in future PRs as described in https://github.com/pytorch/pytorch/issues/67570 to create a functional optimizer from a traditional optimizer. The latter is used in _OptimizerHookState to create a functional optimizer. Both new helper functions are covered by unittests. ghstack-source-id: 147577170 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33688995 fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1 (cherry picked from commit `42fdae2991`)	2022-01-25 18:32:13 +00:00
Rohan Varma	541817628b	[Easy] Add comment explaining DistributedOptimizer gating (#71603 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71603 Small comment to clarify this. ghstack-source-id: 147577171 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33688994 fbshipit-source-id: 4c87e6ed48416a0aad695861893f183bee7c5252 (cherry picked from commit `f8868629c1`)	2022-01-25 18:32:13 +00:00
Rohan Varma	d8abe813bc	[LocalSGD] Move feature to Beta, clean up some docs (#71621 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621 Moves this feature to beta as discussed, and cleans up some docs. Synced offline with wayi1 who mentioned that the current names are preferred as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325. ghstack-source-id: 147382940 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D33700444 fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067 (cherry picked from commit `656e9809b2`)	2022-01-21 21:10:42 +00:00
Adnios	a9c7d626e1	Add the `maximize` flag to AdamW (#70146 ) Summary: Related issue: https://github.com/pytorch/pytorch/issues/68052 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/70146 Reviewed By: malfet Differential Revision: D33254561 Pulled By: albanD fbshipit-source-id: f190c836a4162f936c5953e076747c345df21421	2021-12-23 09:20:29 -08:00
oliver	3d358a7678	Adds a `maximize` flag to Adam (#68164 ) Summary: Solves the next most important use case in https://github.com/pytorch/pytorch/issues/68052. I have kept the style as close to that in SGD as seemed reasonable, given the slight differences in their internal implementations. All feedback welcome! cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/68164 Reviewed By: VitalyFedyunin Differential Revision: D32994129 Pulled By: albanD fbshipit-source-id: 65c57c3f3dbbd3e3e5338d51def54482503e8850	2021-12-13 05:53:53 -08:00
oliver	f8297d40fc	Adds a `maximize` flag to SGD. (#67847 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46480 -- for SGD. ## Notes: - I have modified the existing tests to take a new `constructor_accepts_maximize` flag. When this is set to true, the ` _test_basic_cases_template` function will test both maximizing and minimizing the sample function. - This was the clearest way I could think of testing the changes -- I would appreciate feedback on this strategy. ## Work to be done: [] I need to update the docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/67847 Reviewed By: H-Huang Differential Revision: D32252631 Pulled By: albanD fbshipit-source-id: 27915a3cc2d18b7e4d17bfc2d666fe7d2cfdf9a4	2021-11-09 00:43:07 -08:00
Pritam Damania	05e17e7ff6	Add API usage logging for several other RPC APIs. (#67722 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67722 ghstack-source-id: 142259452 Test Plan: waitforbuildbot Reviewed By: jaceyca, fduwjj Differential Revision: D32118872 fbshipit-source-id: 041ab5601221b1846c56ce4bb63364bec9ad28b0	2021-11-03 14:02:00 -07:00
Rohan Varma	b51731527d	[ez] [Docs] Missing import in example for post_local_sgd (#67047 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67047 Fix missing import ghstack-source-id: 141258423 Test Plan: CI Reviewed By: mrshenli Differential Revision: D31841837 fbshipit-source-id: 139e614517dcac7a53259ff7a0360bb5275bb53b	2021-10-24 01:44:06 -07:00
Yi Wang	c1415a0a72	[Reland] [Model Averaging] Simplify PostLocalSGD Optimizer API (#65197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197 1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type. 2. The parameters are read from local optimizer's param_groups instead of a separate input. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 138307226 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D31007439 fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722	2021-09-17 10:31:58 -07:00
Alban Desmaison	8800a8b428	Revert D30888794: [Model Averaging] Simplify PostLocalSGD Optimizer API Test Plan: revert-hammer Differential Revision: D30888794 (`3d312b3b8e`) Original commit changeset: 21261b480f6b fbshipit-source-id: 87abb7e8cd9ecaac909ec6c3ee053fa7c4ae1975	2021-09-16 06:39:57 -07:00
Yi Wang	3d312b3b8e	[Model Averaging] Simplify PostLocalSGD Optimizer API (#64885 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64885 1) The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type. 2) The parameters are read from local optimizer's `param_groups` instead of a separate input. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 137865867 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D30888794 fbshipit-source-id: 21261b480f6bbb9b2333426020e3f350da3f73c2	2021-09-14 16:37:14 -07:00
Rohan Varma	5b8862abf1	[DDP] Support step_param for AdamW (#63382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63382 Per title ghstack-source-id: 135966156 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D30255446 fbshipit-source-id: e6ffbf339db0bc5b4702d02b74a462309df07c75	2021-08-17 17:16:11 -07:00
Yi Wang	068d6fec5c	[Model Averaging] Add a few member methods of PostLocalSGDOptimizer (#63340 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63340 Some methods are needed such as accessing optimizer states. These are necessary for integration with PyTorch Lightning. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 135912246 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD Reviewed By: rohan-varma Differential Revision: D30328794 fbshipit-source-id: e585b874313bd266fdc7c79936e2af98700c7bad	2021-08-16 16:39:01 -07:00
Andrew Gu	2d75703c6a	Remove req to call step() in training loop (#63164 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63164 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284616 Pulled By: andwgu fbshipit-source-id: afdb677fb08851b139178a9f6d782196f26773e1	2021-08-13 08:22:44 -07:00
Andrew Gu	28f9e108b1	Pass `_allow_empty_param_list` into func opt ctor (#63163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63163 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284615 Pulled By: andwgu fbshipit-source-id: 4857f5b618ec5b007648737ab532ce605e5d70dc	2021-08-13 08:22:42 -07:00
Andrew Gu	bd81c9178a	Simplify data structures, add uniform approximation, fix mem leak (#63162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63162 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284617 Pulled By: andwgu fbshipit-source-id: 9bd9e5f89abcc0d3dac56b85d55cc88e843baa9f	2021-08-13 08:20:59 -07:00
Rohan Varma	39ec1da935	[reland] Gate DistributedOptimizers on RPC availability (#62937 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62937 reland due to windows + cuda failure, fix by running it on gloo on windows even with cuda. ghstack-source-id: 135306176 Test Plan: ci Reviewed By: mrshenli Differential Revision: D30177734 fbshipit-source-id: 7625746984c8f858648c1b3632394b98bd4518d2	2021-08-09 14:41:06 -07:00
Andrew Gu	1b1f1e36b4	Add ``allow_empty_param_list`` to functional optimizers (#62522 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522 Addresses https://github.com/pytorch/pytorch/issues/62481 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D30072074 Pulled By: andwgu fbshipit-source-id: 1a5da21f9636b8d74a6b00c0f029427f0edff0e3	2021-08-09 11:18:56 -07:00
Natalia Gimelshein	b45cf9b81b	Revert D30117838: [WIP] Gate DistributedOptimizers on RPC availability Test Plan: revert-hammer Differential Revision: D30117838 (`3f09485d7e`) Original commit changeset: e6365a910a3d fbshipit-source-id: f276b2b2bdf5f7bd27df473fca0eebaee9f7aef2	2021-08-06 22:10:41 -07:00
Rohan Varma	3f09485d7e	[WIP] Gate DistributedOptimizers on RPC availability (#62774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62774 Gates DistributedOptimizer which relies on RRef based on if RPC is available. This should enable ZeRo to work with Windows as Windows should not try to import the DIstributedOptimizer. If this works as expected we can enable the windows tests for functional/local sgd optimizers as well. ghstack-source-id: 135216642 Test Plan: CI Reviewed By: pbelevich Differential Revision: D30117838 fbshipit-source-id: e6365a910a3d1ca40d95fa6777a7019c561957db	2021-08-06 10:59:00 -07:00
Rohan Varma	1dba329d20	Enable step_param for Adam functional optimizer (#62611 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62611 Enables optimizer overlap with backwards in DDP for Adam. Additional optimizers, especially Adagrad will be done in follow up diffs. 1. Implement `step_param` method based on `step` in _FunctionalAdam (perf permitting we can later dedupe `step` to call `step_param` 2. Modify tests to test all current functional optimizers. ghstack-source-id: 135207143 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29891783 fbshipit-source-id: 321915982afd5cb0a9c2e43d27550f433bff00d1	2021-08-06 10:53:55 -07:00
Andrew Gu	62a90c227f	Make _Join, _Joinable, _JoinHook public (#62605 ) Summary: Overview: This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605 Test Plan: `DistributedDataParallel.join()`: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` `ZeroRedundancyOptimizer`: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing. `Join`: ``` gpurun4 python test/distributed/algorithms/test_join.py ``` Reviewed By: mrshenli Differential Revision: D30055544 Pulled By: andwgu fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026	2021-08-03 12:20:11 -07:00
Andrew Gu	43327cc197	Refactor commonalities between two approaches (#62624 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62624 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30058543 Pulled By: andwgu fbshipit-source-id: 73c794062b75e011868fae264f592549eed67482	2021-08-03 08:43:14 -07:00
Andrew Gu	e6a3967c2a	Add invariant check (bucket indices: 0, 1, ..., k-1) (#62623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62623 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30058544 Pulled By: andwgu fbshipit-source-id: a56910f294c6a40118751eebe255b62700f42be9	2021-08-03 08:13:52 -07:00
Andrew Gu	51f687fd4b	Add overlap with DDP to ZeRO (two approaches) (#62157 ) Summary: Overview: This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration. Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157 Test Plan: The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass: - ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`) - `test_ddp_with_zero_step_parity_gpu` - `test_ddp_with_zero_step_interleaved_parity_gpu` These were tested on the AI AWS cluster. An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302. Both approaches have been verified using an internal accuracy benchmark. Reviewed By: mrshenli Differential Revision: D29971046 Pulled By: andwgu fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8	2021-08-02 08:33:34 -07:00
Yi Wang	2eaf71d749	[Model Averaging] Update model averager API to avoid the redundant `params` arg needed by post-localSGD optimizer (#62132 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62132 as title Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134560541 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_post_localSGD_optimizer_parity buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29887751 fbshipit-source-id: 60dadb04790d800fdcc7cb8a08d060e411718739	2021-07-28 18:43:09 -07:00
Yi Wang	55bee44951	[Model Averaging] Post-localSGD optimizer (#62131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62131 Wrap `PeriodicModelAverager` as an optimizer. Currently both the optimizer and averager require an input `params` arg, where the latter actually can read params from the optimizer wrapper. Will update averager class API in a follow-up PR. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134560248 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D29881465 fbshipit-source-id: b9634972f4d8bffd3b3eb94f5dbbb19db2bcd759	2021-07-28 18:42:06 -07:00
Wanchao Liang	af0f083d42	[dist_optim] fix the bug of none grads on functional optimizers (#62249 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62249 parameter and grads passed to torch.optim.functional should always match, we should skip the parameters that have none gradients to avoid the size mismatch ghstack-source-id: 134452467 Test Plan: test_dist_optim_none_grads Reviewed By: mrshenli Differential Revision: D29929653 fbshipit-source-id: 4ca6167fecdfe1db422236655edee3aa59b8b044	2021-07-27 18:10:51 -07:00
Rohan Varma	6dc2c07304	[Reland] [DDP] Implement a hook which performs FunctionalSGD step. (#62177 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62177 Reland of https://github.com/pytorch/pytorch/pull/61678 Fix CI failure by gating including torchvision model on whether torchvision is available or not. ghstack-source-id: 134282165 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29904101 fbshipit-source-id: 47e799eb4a90acbbda91c5857ea00de3045d49f5	2021-07-26 11:56:56 -07:00
Rohan Varma	2299d6a013	Revert D29701447: [DDP] Implement a hook which performs FunctionalSGD step. Test Plan: revert-hammer Differential Revision: D29701447 (`bd95cf4473`) Original commit changeset: 183954593b82 fbshipit-source-id: 714e6a2b698147db9533a67783aed2a65d9d5bfe	2021-07-25 22:23:30 -07:00
Rohan Varma	bd95cf4473	[DDP] Implement a hook which performs FunctionalSGD step. (#61678 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61678 This diff makes the following changes: - Add `step_param` method to `_FunctionalSGD` class which is written similar to `step` but for a single param - Implement a communication hook wrapper that runs a given comm. hook and then applies functional SGD step - Verifies that this is equal to regular allreduce + SGD optimizerghstack-source-id: 133567598 ghstack-source-id: 134263399 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29701447 fbshipit-source-id: 183954593b82a092414623292f9b10e675fef96e	2021-07-25 13:36:47 -07:00
Andrew Gu	3e3acf8a9a	Minor documentation fixes (#61785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61785 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29746648 Pulled By: andwgu fbshipit-source-id: 435bbd8894f2ae5c814b9acd562673affea1daf6	2021-07-19 09:01:29 -07:00
Yu Guo	a50a389ca6	Revert D29701479: [pytorch][PR] Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` Test Plan: revert-hammer Differential Revision: D29701479 (`9b5d9b4049`) Original commit changeset: c8d5f9057b32 fbshipit-source-id: 35ab1f399513fb9d1c4e73b1fa906e559d2a6994	2021-07-15 10:03:08 -07:00
Andrew Gu	9b5d9b4049	Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` (#61539 ) Summary: Revised version of https://github.com/pytorch/pytorch/issues/60573. Overview: This makes two changes: - It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively. - It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`. The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device. In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU. Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows: `source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU` Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes: `destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU` Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539 Test Plan: I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct. The existing `ZeroRedundancyOptimizer` tests pass. ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` The existing `broadcast_object_list()` test passes: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list ``` Reviewed By: zou3519 Differential Revision: D29701479 Pulled By: andwgu fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9	2021-07-14 17:36:30 -07:00
Andrew Gu	57feb35474	Refactor non-joined process computation (#61555 ) Summary: Overview: This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania. Changes: This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.) The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance. This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition. Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user. New API: Now, the example usage would look like: ``` ddp_model = DistributedDataParallel(...) zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...) with _Join([ddp_model, zero_optim]): ... ``` Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example: ``` with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False): ... ``` They will be forwarded to every `_join_hook()` function via `kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`). Recap:** After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names): - Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context. - We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`. - To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed. - We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`. - We call `Joinable.__init__(self)` in `C`'s constructor. - The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability. - Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555 Test Plan: I ran the existing DDP join tests: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` I ran the ZeRO join tests: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu ``` Reviewed By: zou3519 Differential Revision: D29690359 Pulled By: andwgu fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7	2021-07-14 08:20:40 -07:00
Andrew Gu	4f4beb8286	Add Model Parallel Support to ZeRO (#61370 ) Summary: Overview: The existing `ZeroRedundancyOptimizer` implementation assumes that all model parameters are stored on the same device (due to the recent [refactor](https://github.com/pytorch/pytorch/pull/59834)). This change allows model parameters to be sharded across multiple devices, as in the DDP with Model Parallelism example [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). The only logic affected is the bucketing strategy used when `parameters_as_bucket_view=True`. Let `n` denote the world size and `k` denote the number of devices per process. - Previously, `k = 1`, and `self._buckets` was a `List[torch.Tensor]`, where `self._buckets[j]` is a tensor (i.e. bucket) containing the parameters assigned to rank `j` for `j = 0, ..., n - 1`. - Now, `self._buckets` is a `List[List[torch.Tensor]]`, where `self._buckets[i][j]` is a tensor containing the parameters stored on device `i` assigned to rank `j` for `i = 0, ..., k - 1` and `j = 0, ..., n - 1`. This bucket construction uses an auxiliary data structure `self._device_to_per_rank_params`, which is a `Dict[torch.device, List[List[torch.Tensor]]]`. It maps: - `dev_0` to `[rank 0's assigned parameters on dev_0, rank 1's assigned parameters on dev_1, ...]`, - `...` - `dev_{k-1}` to `[rank 0's assigned parameters on dev_{k-1}, rank 1's assigned parameters on dev_{k-1}, ...]` I removed the invariant checker `_verify_same_param_device()` and its corresponding test since it is no longer an invariant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61370 Test Plan: I added a new test `test_zero_model_parallel()` that checks for parity between a DDP model with model parallelism using `ZeroRedundancyOptimizer` and a local model with the same architecture using a local optimizer. I also verified that the existing tests still pass. Reviewed By: soulitzer Differential Revision: D29637132 Pulled By: andwgu fbshipit-source-id: 07112959fa4e94a3f40e67e88cbb58ce3cd1e033	2021-07-09 14:27:47 -07:00
Andrew Gu	179249084b	Refactor DDP join() API, adding hooks (#60757 ) Summary: Targets https://github.com/pytorch/pytorch/issues/54318. Overview: DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.) There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager. The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications. The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same. Notes: - The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing. - `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks. - `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`). - Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden. - If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications. - If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757 Test Plan: The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- ``` Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually. Reviewed By: iramazanli, mrshenli Differential Revision: D29624636 Pulled By: andwgu fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7	2021-07-09 08:29:20 -07:00
Andrew Gu	b770c4b61a	Fix ZeRO sort to be by numel (#60556 ) Summary: Overview: This is a follow-up to [this PR](https://github.com/pytorch/pytorch/pull/59586) and corrects the ZeRO partitioning algorithm to sort by the number of elements in the tensor rather than the size of the first dimension. As context, that PR was meant to migrate from using a _naive greedy_ algorithm to a _sorted-greedy_ algorithm when partitioning parameters in ZeRO. Updated Results: The updated table for the partitions can be found [here](https://github.com/pytorch/pytorch/pull/59410#issuecomment-865203219). There, I also considered a third algorithm (sometimes known as multifit), which is more computationally expensive than the greedy and sorted-greedy algorithms but cannot perform worse. However, because of its increased complexity and lack of improved results, I chose to settle with the simpler sorted-greedy algorithm. The `step()` latencies show slight improvements, but the improvements may be in the noise. The values below are in seconds and were generated using NCCL backend (unlike in the previous PR which used Gloo): Two processes: \| Model \| Max `optimizer.step()` Time - Greedy (Std.) \| Max `optimizer.step()` Time - Sorted-Greedy (Std.) \| \| --- \| --- \| --- \| \| ResNet-50 \| 0.047 (0.00142) \| 0.044 (0.00025) \| \| ResNet-152 \| 0.057 (0.00034) \| 0.054 (0.00022) \| \| BERT \| 0.021 (0.00008) \| 0.020 (0.00008) \| Four processes: \| Model \| Max `optimizer.step()` Time - Greedy \| Max `optimizer.step()` Time - Sorted-Greedy (Std.) \| \| --- \| --- \| --- \| \| ResNet-50 \| 0.019 (0.00065) \| 0.013 (0.00040) \| \| ResNet-152 \| 0.045 (0.00024) \| 0.045 (0.00025) \| \| BERT \| 0.019 (0.00022) \| 0.018 (0.00016) \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/60556 Test Plan: I verified that the ZeRO tests pass (via the AI AWS cluster): ``` srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` Reviewed By: VitalyFedyunin Differential Revision: D29335260 Pulled By: andwgu fbshipit-source-id: 469d1c6e029b77c1b300a94cd1fd94b633cd28dd	2021-06-23 15:22:36 -07:00
Andrew Gu	f0e4e4be72	Clean Up ZeRO (#60285 ) Summary: Overview: Being relatively new to PyTorch and ZeRO, I found parts of the code slightly hard to follow. This change strives to clean up the `ZeroRedundancyOptimizer` code in `zero_redundancy_optimizer.py` by reorganizing some computations, making variable names more explicit and consistent, and unifying terminology in the documentation. The goal is for the code to be easier to extend afterwards. Changes: 1) `state_dict()`: The [logic](`85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L510)`) for updating the global `state_dict` with each rank's local `state_dict` is simplified and made more explicit. Notably, the `dict` [`local_index_to_param_id`](`85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L513)`) is unneeded. It maps `local_pg["params"][i]` to `id(global_pg["params"][i])`, so it is equivalent to make a single pass over both lists in tandem, effectively iterating over `i`, without a need for the explicit `dict`. 2) `_update_trainable()`: The function [initializes](`85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L597)`) the local optimizer if it does not exist. I am unaware of any reason for the local optimizer to be destroyed after initialization, so I moved that logic to its own function `_init_local_optimizer()`, which is called once in the constructor. After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654706728), I removed the function `_update_trainable()` itself in favor of adding a check for `parameters_as_bucket_view` in `build_param_buckets()` directly. 3) `rank_local_state_dict()`: This [function](`85517a2b70/torch/distributed/optim/zero_redundancy_optimizer.py (L528)`) is currently broken. It appears to be legacy and relies on the input `state_dict` to have the key `"partitions"`. For now, I have removed it and added an [issue](https://github.com/pytorch/pytorch/issues/60284). Is it a notable use case to want to access another rank's `state_dict` in particular (as opposed to consolidating the entire state and then accessing)? 4) `local_state_dict():` After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r655571043), I removed the function. 5) `partition_parameters()`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654708183), I renamed the function to `_partition_parameters()` to mark it as private. 6) `_param_to_index`: After [discussion](https://github.com/pytorch/pytorch/pull/60285#discussion_r654828100), I changed the key to be the parameter itself rather than its integer ID. 7) `buckets`: I renamed the data structure to `_buckets` to mark it as private. 8) Terminology: I tried to reduce the set of terms being used instead of juggling a number of synonyms. In particular, I made an effort to distinguish between "local" and "global" and to make names more indicative of typing. 9) Style: Per the [PyTorch contributing guide](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation), I made all docstrings abide by the 80 character limit, except for the one [line](`554891f6fa/torch/distributed/optim/zero_redundancy_optimizer.py (L142)`) showing the example ZeRO usage. Some code lines violate the limit for readability. Also, I unified some of the minor stylistic usages out of habit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60285 Test Plan: The test suite passes as expected (on the AI AWS cluster): ``` gpurun python test/distributed/optim/test_zero_redundancy_optimizer.py ``` I visually inspected the generated HTML doc (as generated following [this](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#writing-documentation)). Reviewed By: mrshenli Differential Revision: D29320726 Pulled By: andwgu fbshipit-source-id: 23f69a19ecc5e877a38fe1df0da11329428311dd	2021-06-23 07:21:40 -07:00
Andrew Gu	c6cdb4f113	Refactor ZeroRedundancyOptimizer Assuming SPSD (#59834 ) Summary: Overview: This refactors the `ZeroRedundancyOptimizer` implementation to assume single-process single-device (SPSD) instead of accommodating single-process multiple-device (SPMD). `DistributedDataParallel` [retired SPMD recently](https://github.com/pytorch/pytorch/issues/47012), so this change follows the same spirit. Changes: The parent-class `Optimizer` constructor permits the input argument `params` to be both an `iterable` of `torch.Tensor` and an `iterable` of `dict`. The latter usage is for initializing the optimizer with multiple `param_group`s to start. However, currently, `ZeroRedundancyOptimizer` only supports the former usage, requiring explicit calls to `add_param_group()` for multiple `param_group`s. Given the existing implementation, the type error would be silent and not manifest until much later (e.g. since `super().__init__()` would have no issue). Hence, I added a series of checks to begin the `__init__()` function (encapsulated in `_verify_and_init_params()`). A postcondition of this validation is that `self._all_params` is a non-empty list of all model parameters. Additionally, I added a check for SPSD usage assuming that all model parameters exist on the same device. This logic is included in `_verify_same_param_device()` and is called immediately after the `params` type-checking. Support for SPSD with model parameters sharded across devices may be added in the future. Related to that aforementioned post-condition on `self._all_params`, previously there was undefined behavior resulting from different typing of the passed in `params` input argument. If `params` was a `List`, then the usage of `self._reference_is_trainable_mask` was as expected. However, if `params` was a generator (e.g. as in the canonical usage of passing `model.parameters()`), then the ensuing behavior was divergent. This is because after a generator is iterated over, it is empty. As a result, when we set `self._all_params = params` [in the old code](`68d690ffbd/torch/distributed/optim/zero_redundancy_optimizer.py (L165)`), `self._all_params` is empty, reducing `training_mask` to always be the empty list. This causes missed calls to `_update_trainable()` in `step()`. (A consequence of this is that `test_pytorch_parity()`, which is renamed to `test_local_optimizer_parity()`, now outputs warnings about the trainable parameters changing.) The existing implementation assumes that all parameters share the same dense type when allocating the bucket buffers. This change preserves this assumption, which may be removed in the future. I added a check for this in `_verify_same_dense_param_type()` to avoid erroring silently later on. Note that it is insufficient to simply check for the same `dtype` since dense and sparse tensors may share the same `dtype` but require differing storage sizes. One solution is to use `torch.typename()` as the means for comparison. --- The primary change in this refactor is with respect to `self._per_device_params` and `self.buckets`. `self._per_device_params` mapped `torch.device` to `List[List[Parameter]]`. The keys were the devices that the model parameters exist on, and the values designated which ranks are assigned to updating those parameters. `self.buckets` mapped `torch.device` to `List[torch.Tensor]`. The keys were the same as `self._per_device_params`, and the values were the buckets for that device. The usage of these two data structures were confined to each other only. Hence, because the notions of device and rank are now in 1:1 correspondence, we can eliminate the former completely and only use rank. As such, I removed `self._per_device_params` and made `self.buckets` directly a list of buckets (i.e. `torch.Tensor`s). Iteration over the parameters of a rank for a given device could be simplified to just iteration over the parameters of a rank. Hence, I relied on `self.partition_parameters()` now for that iteration. Refer to `_setup_flat_buffers()` and `step()` for these changes. One convenient side effect of removing `self._per_device_params` is that there is no longer the re-computation of the parameter partitions mentioned at the end of this [PR](https://github.com/pytorch/pytorch/pull/59410). --- I changed the data structure `self._index_to_param_cache` from a `dict` to a `List` because the domain is `0`, `1`, ..., `k-1` where `k` is the number of parameters. This should yield marginal improvements in memory usage and access speed. `_sync_param_groups()` is a static method, meaning it can be called either via `self._sync_param_groups()` or `ZeroRedundancyOptimizer._sync_param_groups()` when inside the class. I made the usage consistently `self._sync_param_groups()` rather than have instances of both. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59834 Test Plan: I ran through the existing test suite on an AI AWS cluster: ``` srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` Note: The only test where `parameters_as_bucket_view` is `True` is `test_step_with_closure()`, meaning that that is the test that exercises the core changes of removing `self._per_device_params` and changing `self.buckets`. Also, I added tests for the `ZeroRedundancyOptimizer` constructor changes and the assumption checks. Reviewed By: mrshenli Differential Revision: D29177065 Pulled By: andwgu fbshipit-source-id: 0ff004ae3959d6d3b521024028c7156bfddc93d8	2021-06-16 20:52:13 -07:00
andwgu	a4e0368c99	Comment on tests reliance on ZeRO's partitioning algo (#59713 ) Summary: Addresses https://github.com/pytorch/pytorch/issues/59548 Overview: Recently, we changed ZeRO's partitioning algorithm to first sort the parameters by decreasing size and then greedily allocate to shards. See [here](`ea1de87f4b`). The current tests `test_sharding()` and `test_add_param_group()` check for a uniform partitioning, which is not achieved with the old naive greedy partitioning algorithm for general world sizes but is achieved with the new sorted-greedy algorithm. This reliance is not ideal, but for now, we opt to simply add comments to document the dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/59713 Test Plan: I tested for world sizes of 1, 2, 3, and 4 via the AI AWS cluster: ``` srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group ``` However, because the train queue (which offers instances with 8 GPUs) is not working at the moment, I was unable to test for world sizes of 5+. Nonetheless, I believe that they should still work. First, consider `test_sharding()`. Given the sorted-greedy algorithm, each shard will be assigned one of the parameters with size `9`, then one of the parameters with size `7`, then `5`, and finally `3`. Hence, each will have a uniform partition. Now, consider `test_add_param_group()`. Similarly, the same allocation behavior occurs, only the last shard is not assigned the final parameter with size `3` to begin. However, after adding the new `param_group` with the parameter with size `3`, a re-partitioning occurs. The first `param_group` is partitioned as before, and the parameter with size `3` in the new `param_group` is assigned to the last shard since it has the minimal total size. Thus, in the end, all shards have a uniform partition. Reviewed By: mrshenli Differential Revision: D28996460 Pulled By: andwgu fbshipit-source-id: 22bdc638d8569ed9a20836812eac046d628d6df2	2021-06-09 19:56:28 -07:00
Andrew Gu	ea1de87f4b	Sort params by size (decreasing) Summary: Pull Request: https://github.com/pytorch/pytorch/pull/59586 Task: https://www.internalfb.com/tasks/?t=90847711 Overview: Suppose we have `n` items with positive integer sizes and `k` buckets. We want to assign items to buckets with the goal of uniformity. The precise criteria for uniformity can vary: e.g. minimize the maximum size, maximize the minimum size, etc. This is known as [multiway number partitioning](https://en.wikipedia.org/wiki/Multiway_number_partitioning). ZeRO's partitioning task reduces to solving this problem. In particular, this is the subproblem to be solved for each `param_group` in `self.param_groups`, where the parameters are the items and the ranks give the buckets. The existing implementation uses the linear-time [greedy number partitioning algorithm](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Linear-time_algorithm), which assigns the next tensor-parameter to the process with the smallest total parameter size so far. In this task, I explore the [extension](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Improved_algorithm) where each parameter group is sorted by decreasing size before applying the greedy algorithm, requiring linearithmic time (as dominated by the sort). Experiments The mean number of parameters represents a perfectly uniform allocation and hence the ideal allocation (which may be even better than the optimal partition). In the following tables, I present the maximum number of parameters for any one process and the difference from the mean in parentheses for ResNet-50, ResNet-152, and BERT (the bare BERT model). The best-performing partitioning strategy for each model is bolded. Two processes: \| Model \| Max Num Params - Greedy (Diff) \| Max Num Params - Greedy-Sorted (Diff) \| Mean Num Params \| \| --- \| --- \| --- \| --- \| \| ResNet-50 \| 13,249,600 (471,084) \| 12,794,816 (16,300) \| 12,778,516 \| \| ResNet-152 \| 30,567,488 (471,084) \| 30,111,424 (15,020) \| 30,096,404 \| \| BERT \| 54,749,184 (8,064) \| 55,327,488 (586,368) \| 54,741,120 \| Four processes: \| Model \| Max Num Params - Greedy (Diff) \| Max Num Params - Greedy-Sorted (Diff) \| Mean Num Params \| \| --- \| --- \| --- \| --- \| \| ResNet-50 \| 7,524,864 (1,135,606) \| 6,436,864 (47,606) \| 6,389,258 \| \| ResNet-152 \| 16,232,192 (1,183,990) \| 15,090,152 (41,950) \| 15,048,202 \| \| BERT \| 28,151,040 (780,480) \| 28,352,256 (981,696) \| 27,370,560 \| --- I also investigated the latency of `optimizer.step()` for the different partitioning algorithms. I measured the latency for 30 iterations and took the mean latency per process (excluding the first iteration due to cache coldness). In the following tables, I present the maximum of those mean latencies over all processes and the standard deviation of the latencies contributing to that maximum. Again, the best-performing partitioning strategy for each model is bolded. All entries are presented in seconds and used `gloo` backend. Two processes: \| Model \| Max `optimizer.step()` Time - Greedy (Std.) \| Max `optimizer.step()` Time - Greedy-Sorted (Std.) \| \| --- \| --- \| --- \| \| ResNet-50 \| 0.060 (0.002) \| 0.061 (0.002) \| \| ResNet-152 \| 0.166 (0.003) \| 0.160 (0.004) \| \| BERT \| 0.220 (0.009) \| 0.199 (0.006) \| Four processes: \| Model \| Max `optimizer.step()` Time - Greedy \| Max `optimizer.step()` Time - Greedy-Sorted \| \| --- \| --- \| --- \| \| ResNet-50 \| 0.094 (0.004) \| 0.093 (0.004) \| \| ResNet-152 \| 0.228 (0.011) \| 0.231 (0.009) \| \| BERT \| 0.328 (0.015) \| 0.329 (0.021) \| Based on the standard deviations, the differences in the latency measurements across the different algorithms appear to be within the uncertainty in the measurement itself. Hence, it is difficult to argue that one algorithm is clearly the fastest. --- `zero.py` is my experiment script, and I use the AI AWS cluster. The run command looks like: ``` srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python zero.py -b nccl greedy 2 4 ``` This runs the experiment script on an instance with 4 GPUs using `nccl` backend, outputting to a directory named `greedy/`, and using world sizes of 2 and 4. An analogous command can be used after modifying `partition_parameters()`, e.g. replacing `greedy` with `greedy_sorted` as the output directory name. Then, to run the analysis script: ``` python analyze.py greedy greedy_sorted ``` For more details on the experiment code, refer to: https://www.internalfb.com/diff/D28946756 Notes: There exists an optimal solution to this partitioning problem. An algorithm that finds such a solution is the [complete greedy algorithm (CGA)](https://en.wikipedia.org/wiki/Greedy_number_partitioning#An_exact_algorithm), which reduces to the brute-force combinatorial search in the worst case. There exist heuristics to improve the `k = 2` case (i.e. when there are two processes); however, given that `n` in typical use cases is very large, any algorithm that is quadratic or slower is unrealistic. Other exact algorithms are similarly exponential in the worst case, rendering them intractable. Given this, I do not currently see a need for future proofing the partitioning algorithm against the introduction of algorithms beyond the naive greedy and the sorted greedy algorithms. --- In the current ZeRO implementation, the core `partition_parameters()` computation happens twice upon initialization (i.e. call to `__init__()`): first from a call to `_param_to_rank()` (i.e. an access to `_param_to_rank`) and then from a call to `_update_trainable()`. `_update_trainable()` sees that no optimizer has been constructed yet, so it clears the cache, eliminating the first `partition_parameters()` computation and performing a redundant re-computation. Here is a typical trace: - [The ZeRO optimizer object is initialized, calling `__init__()`.](`d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L142)`) - [In `__init__()`, `self._device` is set, so it accesses `self._per_device_params`.](`d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L182)`) - [`self._per_device_params` is not cached, so it accesses `self._param_to_rank`.](`d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L340)`) - [`self._param_to_rank` is not cached, so it calls `partition_parameters()`.](`d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L353)`) (first call to `partition_parameters()`) - [`__init__()` later calls `_update_trainable()`.](`d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L185)`) - [In `_update_trainable()`, `self` does not have `attr` `"optim"`, so it clears the cached objects (notably, `self._partition_parameters_cache`).](`d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L591)`) - [`_update_trainable()` calls `self.partition_parameters()`.](`d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L593)`) (second call to `partition_parameters()`) Based on the discussion [here](https://github.com/pytorch/pytorch/pull/59410), this recomputation is unintentional and should be addressed in a future diff. Test Plan: I verified that the total number of parameters across the processes was consistent after the partitioning algorithm change. Otherwise, no additional modifications were made to existing tests. Reviewed By: mrshenli Differential Revision: D28946755 fbshipit-source-id: 7ad66a21a963555b3b2e693ba8069d2dddc94c60	2021-06-08 09:47:35 -07:00
Wanchao Liang	cb7c6a536b	[doc] update distributed optimizer doc (#58084 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58084 update the doc for distributed optimizer with TorchScript support. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D28363971 Pulled By: wanchaol fbshipit-source-id: df9d2acc1bbb2292d683d2231e1349b8d3946c8f	2021-05-13 23:37:00 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Wanchao Liang	4611387608	[optim] take kw-only argument for functional optim APIs (#56185 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56185 ghstack-source-id: 126670123 Reviewed By: albanD Differential Revision: D27802169 fbshipit-source-id: f5e1cb2046dcdeecf5f6b0f70892828bf0adb22f	2021-04-15 20:08:04 -07:00
Wanchao Liang	dd090e72b2	[dist_optim] add distributed functional rprop optimizer (#55834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55834 ghstack-source-id: 126325536 Reviewed By: rohan-varma Differential Revision: D27703878 fbshipit-source-id: 5c8ec9a4ccb4442b2b51d48d75ea5cd506179f14	2021-04-15 15:19:44 -07:00
Wanchao Liang	4e9e7200f2	[dist_optim] Add distributed functional Adamax optimizer (#55833 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55833 Add distributed functional Adamax optimizer, to support in TorchScript ghstack-source-id: 126325538 Reviewed By: rohan-varma Differential Revision: D26696540 fbshipit-source-id: 6242faebd2476847831a05df7f8b0d616f2b5355	2021-04-15 15:19:43 -07:00
Sam Estep	8c798e0622	Forbid trailing whitespace (#53406 ) Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)/contrib/' ':(exclude)third_party' \| xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97	2021-03-05 17:22:55 -08:00
Benjamin Lefaudeux	43906f9b8b	[ZeroRedundancyOptimizer] Minor stub fix (#53165 ) Summary: Not sure how important that is Tied to https://github.com/pytorch/pytorch/issues/53108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53165 Reviewed By: albanD Differential Revision: D26781956 Pulled By: blefaudeux fbshipit-source-id: b7daca0ea95be190a5ffeae12123e301204ed4eb	2021-03-03 10:15:10 -08:00
Shen Li	29034b9487	[Reland] Update and expose ZeroRedundancyOptimizer docs (#53112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53112 Test Plan: Imported from OSS Reviewed By: blefaudeux Differential Revision: D26752289 Pulled By: mrshenli fbshipit-source-id: 897257417b530e6e18788cb40c44e5cb7ac688d5	2021-03-02 14:16:12 -08:00
Shen Li	931100f829	Revert D26696938: Update and expose ZeroRedundancyOptimizer docs Test Plan: revert-hammer Differential Revision: D26696938 (`a586c02962`) Original commit changeset: dafb00e5c9f0 fbshipit-source-id: b08604d2009f4df7b620699dd6659dfed2b02792	2021-03-02 07:14:23 -08:00
Shen Li	a586c02962	Update and expose ZeroRedundancyOptimizer docs (#52937 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52937 Test Plan: Imported from OSS Reviewed By: blefaudeux Differential Revision: D26696938 Pulled By: mrshenli fbshipit-source-id: dafb00e5c9f0c0c602f471fdcb6416bde74f806b	2021-03-01 20:50:33 -08:00
Benjamin Lefaudeux	812339ca3d	[ZeroRedundancyOptimizer] Buckets as tensor view + minimize public interface (#52987 ) Summary: Updated version following https://github.com/pytorch/pytorch/issues/52764 (including comments from Shen), but this one I expect to be able to land. ZeroRedundancyOptimizer: - bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views Pull Request resolved: https://github.com/pytorch/pytorch/pull/52987 Reviewed By: mrshenli Differential Revision: D26728851 Pulled By: blefaudeux fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c	2021-03-01 14:37:04 -08:00
Benjamin Lefaudeux	249c213462	[ZeroRedundancyOptimizer] Pytorch compliant state (#52960 ) Summary: Same as https://github.com/pytorch/pytorch/issues/52760 which I could not get to land. I just could not live with ghstack/ghimport/randomly broken things, I break enough of them myself, so this is a fresh copy without ghstack shenanigans. I'm hopeful that this can land relatively bug free, and am sorry for the duplications.. What this does: - call the common_utils test runner instead of unittest, because it seems that it's how it should be done - change the returned state from ZeroRedundancyOptimizer to be PyTorch compliant, which has the added benefit of being elastic (world size independent) Pull Request resolved: https://github.com/pytorch/pytorch/pull/52960 Reviewed By: mrshenli Differential Revision: D26710932 Pulled By: blefaudeux fbshipit-source-id: 1d914bc9221442ba1bb2b48f5df10c313e674ece	2021-02-27 11:54:08 -08:00
Benjamin Lefaudeux	7ae7768617	[ZeroRedundancyOptimizer] Remove pseudo futures handling, not needed (#52698 ) Summary: This was mostly needed for ShardedDDP, not used here, dead code removal Pull Request resolved: https://github.com/pytorch/pytorch/pull/52698 Reviewed By: mrshenli Differential Revision: D26617893 Pulled By: blefaudeux fbshipit-source-id: 9bcfca5135bf332ebc1240300978c138d2041146	2021-02-24 11:39:59 -08:00
Chester Liu	58eb23378f	Clean up usage of torch._six partially (#49785 ) Summary: See https://github.com/pytorch/pytorch/issues/42919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49785 Reviewed By: mruberry Differential Revision: D25963833 Pulled By: bugra fbshipit-source-id: 11c90d6b8d3f206c9d0a4d8621b773beb10c6ba2	2021-02-08 13:58:34 -08:00
Vincent Quenneville-Belair	50d903f19f	[optim] make functional api be private (#51316 ) (#51665 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51665 This reverts commit `896f82aa92`. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D26232608 Pulled By: vincentqb fbshipit-source-id: ca006baf4fb672c11c1bb003c39a29cbadb63dd3	2021-02-03 17:59:05 -08:00
Vincent Quenneville-Belair	896f82aa92	[optim] make functional api be private (#51316 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51316 Make optim functional API be private until we release with beta Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26213469 fbshipit-source-id: b0fd001a8362ec1c152250bcd57c7205ed893107	2021-02-03 09:29:33 -08:00
Alban Desmaison	c311b8961a	Revert D26113953: [pytorch][PR] [ZeroRedundancyOptimizer] Elastic and pytorch compatible checkpoints Test Plan: revert-hammer Differential Revision: D26113953 (`bbe18e3527`) Original commit changeset: 030bfeee2c34 fbshipit-source-id: 6c1494ad01c2f96a15601329b4fce3fef4b38a01	2021-02-03 06:12:21 -08:00
Benjamin Lefaudeux	bbe18e3527	[ZeroRedundancyOptimizer] Elastic and pytorch compatible checkpoints (#50956 ) Summary: - Makes it possible to use non-sharded optimizer checkpoints (as long as the model/param groups are the same, of course) - Makes it possible to save with a given world size, and load with another world size - Use Torch Distributed built-in broadcast object list instead of a ad-hoc version Pull Request resolved: https://github.com/pytorch/pytorch/pull/50956 Reviewed By: malfet Differential Revision: D26113953 Pulled By: blefaudeux fbshipit-source-id: 030bfeee2c34c2d987590d45dc8efe05515f2e5c	2021-02-02 14:32:13 -08:00
Wanchao Liang	662b6d2115	[dist_optim] update the doc of DistributedOptimizer (#51314 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51314 updating the doc of DistributedOptimizer to include TorchScript enablement information Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D26156032 Pulled By: wanchaol fbshipit-source-id: 1f3841f55918a5c2ed531cf6aeeb3f6e3a09a6a8	2021-01-29 17:12:52 -08:00
Wanchao Liang	3562ca2da2	[dist_optim] add warning to distributed optimizer (#50630 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50630 Add a warning log to distributed optimizer, to warn user the optimizer is created without TorchScript support. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932777 Pulled By: wanchaol fbshipit-source-id: 8db3b98bdd27fc04c5a3b8d910b028c0c37f138d	2021-01-26 10:30:55 -08:00
Wanchao Liang	2c3c2a4b7a	[dist_optim] add distributed functional AdamW optimizer (#50620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50620 Add TorchScript compatible AdamW functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932774 Pulled By: wanchaol fbshipit-source-id: 64eb4aeaa3cab208d0ebbec7c4d91a9d43951947	2021-01-23 01:04:45 -08:00
Wanchao Liang	3f982e56b1	[dist_optim] add distributed functional RMSprop optimizer (#50619 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50619 Add TorchScript compatible RMSprop functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932775 Pulled By: wanchaol fbshipit-source-id: bd4854f9f95a740e02a1bebe24f780488460ba4d	2021-01-23 01:04:41 -08:00
Wanchao Liang	6c81b4d917	[dist_optim] add distributed functional Adadelta optimizer (#50623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50623 Add TorchScript compatible Adadelta functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932772 Pulled By: wanchaol fbshipit-source-id: d59b04e5f0b6bab7e0d1c5f68e66249a65958e0b	2021-01-23 01:04:36 -08:00
Wanchao Liang	cd2067539e	[dist_optim] add distributed functional sgd optimizer (#50618 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50618 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932778 Pulled By: wanchaol fbshipit-source-id: 8df3567b477bc5ba3556b8c5294cd3da5db963ad	2021-01-23 01:04:32 -08:00
Wanchao Liang	5cbe1e4933	[dist_optim] add distributed functional Adam optimizer (#50624 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50624 Add TorchScript compatible Adam functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932770 Pulled By: wanchaol fbshipit-source-id: cab3f1164c76186969c284a2c52481b79bbb7190	2021-01-23 01:01:37 -08:00
Benjamin Lefaudeux	87fb3707d9	ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper (#46750 ) Summary: Implement the first stage of ZeRO, sharding of the optimizer state, as described in [this blog post](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/) and [this paper](https://arxiv.org/abs/1910.02054). This implementation is completely independent from the [DeepSpeed](https://github.com/microsoft/DeepSpeed) framework, and aims at providing ZeRO-compliant building blocks within the PyTorch scheme of things. This works by: - acting as a wrapper to a pytorch optimizer. ZeROptimizer does not optimize anything by itself, it only shards optimizers for distributed jobs - each rank distributes parameters according to a given partitioning scheme (could be updated), and owns the update of a given shard only - the .step() is called on each rank as expected, the fact that the optimizer actually works on a shard of the model is not visible from the outside - when the update is completed, each rank broadcasts the updated model shard to all the other ranks This can be used with DDP, although some communications are wasted in that case (gradients are all-reduced to all ranks). This implementation was initially developed in [Fairscale](https://github.com/facebookresearch/fairscale), and can also be used with an optimized DDP which only reduces to the relevant ranks. More context on ZeRO and PyTorch can be found in [this RFC](https://github.com/pytorch/pytorch/issues/42849) The API with respect to loading and saving the state is a known pain point and should probably be discussed an updated. Other possible follow ups include integrating more closely to a [modularized DDP](https://github.com/pytorch/pytorch/issues/37002), [making the checkpoints partition-agnostic](https://github.com/facebookresearch/fairscale/issues/164), [exposing a gradient clipping option](https://github.com/facebookresearch/fairscale/issues/98) and making sure that mixed precision states are properly handled. original authors include msbaines, min-xu-ai and myself Pull Request resolved: https://github.com/pytorch/pytorch/pull/46750 Reviewed By: mruberry Differential Revision: D25958918 Pulled By: blefaudeux fbshipit-source-id: 14280f2fd90cf251eee8ef9ac0f1fa6025ae9c50	2021-01-20 14:36:16 -08:00
Wanchao Liang	505be08c75	[dist_optim] serialize compilation when creating dist_optim (#45871 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45871 Attempt to fix https://github.com/pytorch/pytorch/issues/45845 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D24125209 Pulled By: wanchaol fbshipit-source-id: e3697dd6ef107d8153d2a82d78a17c66d109b4fa	2020-10-07 15:10:41 -07:00
Wanchao Liang	32c355af5b	[dist_optim] introduce distributed functional optimizer (#45221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221 This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training Test Plan: Imported from OSS Reviewed By: ailzhang Differential Revision: D23935256 Pulled By: wanchaol fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a	2020-09-25 17:13:10 -07:00
Shen Li	f05abd1259	Fix example block format in Distributed Optimizer API doc (#34919 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34919 Test Plan: Imported from OSS Differential Revision: D20500013 Pulled By: mrshenli fbshipit-source-id: d28cbdd1ec207e1e8501ce389b7040fb764f12ca	2020-03-17 17:44:09 -07:00
Rohan Varma	f933fa3613	[docs][1.5] update RPC docs to reflect correct use of dist_autograd backwards and dist_optim step() (#34670 ) Summary: - Clarify that `torch.distributed.autograd.backwards()` does not use the current thread local autograd context, instead it looks it up based on the context_id passed in - Clarify the same for `torch.distributeed.optimizer.optim.step()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/34670 Differential Revision: D20427645 Pulled By: rohan-varma fbshipit-source-id: a1a88de346cdd4dbe65fb2b7627157f86fd2b6a3	2020-03-13 14:09:23 -07:00
Omkar Salpekar	24dd800e6a	[Dist Autograd] Functional API for Dist Autograd and Dist Optimizer (#33711 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33711 Fixed #33480 This makes `dist_autograd.backward` and `dist_optimizer.step` functional by making the user explicitly pass in the `context_id` as opposed to relying on the confusing thread_local context_id. This diff incorporates these API changes and all places where these functions are called. More concretely, this code: ``` with dist_autograd.context(): # Forward pass. dist_autograd.backward([loss.sum()]) dist_optim.step() ``` should now be written as follows: ``` with dist_autograd.context() as context_id: # Forward pass. dist_autograd.backward(context_id, [loss.sum()]) dist_optim.step(context_id) ``` Test Plan: Ensuring all existing dist_autograd and dist_optimizer tests pass with the new API. Also added a new test case for input checking. Differential Revision: D20011710 fbshipit-source-id: 216e12207934a2a79c7223332b97c558d89d4d65	2020-02-26 19:08:28 -08:00
Pritam Damania	359c39b3c2	Use global lock instead of per instance lock. (#31404 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31404 Multiple "trainers" could each create different instances of DistributedOptimizer, which means we can still have a race condition unless we do a trully global per worker lock. ghstack-source-id: 95874624 Test Plan: run unit tests -- unfortunatelly due to the non-deterministic behavior it's not clear how to unit test this properly. Differential Revision: D19154248 fbshipit-source-id: fab6286c17212f534f1bd1cbdf9f0de002d48c74	2019-12-18 09:22:54 -08:00
Alisson Gusatti Azzolini	07e14c7cd0	DistributedOptimizer: wait for all workers to finish _LocalOptimizer constructor (#30062 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30062 This allows to catch exceptions during optimizer creation. ghstack-source-id: 94232436 Test Plan: new unit test. Differential Revision: D18586108 fbshipit-source-id: 71cfdf337fe803dbea8787b4c68e5a52b70a1f68	2019-11-19 18:30:00 -08:00
Pritam Damania	5d69bc1eda	Add docs for distributed optimizer. (#29971 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29971 ghstack-source-id: 94132160 Test Plan: waitforbuildbot Differential Revision: D18554631 fbshipit-source-id: c4485f7cff5159f423d0f35d1caf71074b62dc28	2019-11-18 18:51:26 -08:00
Alisson Gusatti Azzolini	b0cf43b2dd	Simple distributed optimizer (#29304 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29304 Implements a simple python distributed optimizer that takes rrefs to parameters that will be optimized. It keeps instances of optimizers remotely and calling step on distributed optimizer will call step on each of the remote optimizers in parallel. ghstack-source-id: 93564364 Test Plan: unit tests. Differential Revision: D18354586 fbshipit-source-id: 85d4c8bfec4aa38d2863cda704d024692511cff5	2019-11-11 12:02:24 -08:00

1 2 3 4 5

225 Commits