pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
fduwjj	1a48ae96ba	[PT-D][Easy] Reformat the optim code within PTD code base (#90399 ) Just run two commands: ``` ufmt format torch/distributed/optim/ ufmt format test/distributed/optim/ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90399 Approved by: https://github.com/awgu	2022-12-08 06:38:59 +00:00
fduwjj	85ae28b454	Reformat optim import (#90294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90294 Approved by: https://github.com/awgu	2022-12-07 07:11:12 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
fduwjj	1abe264ef0	[Upstream _NamedOptimzer] Reland PR (89480) (#90293 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Reland https://github.com/pytorch/pytorch/pull/89480/ * #90294 * __->__ #90293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90293 Approved by: https://github.com/awgu	2022-12-06 21:47:12 +00:00
PyTorch MergeBot	176b962f4b	Revert "[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480 )" This reverts commit `31ec1a1ef7`. Reverted https://github.com/pytorch/pytorch/pull/89480 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names	2022-12-06 07:22:37 +00:00
fduwjj	31ec1a1ef7	[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480 ) In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters. Now composability workstream need a FQN based way to index optimizer state_dict for parameters.. For example, SGD optimizer might have something in its `state_dict` like: ``` {'state': {0: {'momentum_buffer': tensor(...)}, {1: {'momentum_buffer': tensor(...)}, ... } 'param_groups': [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}] } ``` And in NamedOptimizer we want the `state_dict` can be: ``` {'state': {'net1.0.weight': {'momentum_buffer': tensor(...)}, {'net1.0.bias': {'momentum_buffer': tensor(...)}, ... } 'param_groups': [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}] } ``` We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer. For the next couple PR/diffs, we also need to: 1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components. 2. Make `NamedOptimizer` works well with apply_optim_in_backward 3. Upstream also `CombinedOptimizer`. Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480 Approved by: https://github.com/rohan-varma	2022-12-06 04:34:19 +00:00
Rohan Varma	404f254e20	Upstream apply_optim_in_backward from TorchRec (#87397 ) (#88539 ) Summary: Upstreaming this as part of sharing common APIs. This is just a plain move, any changes needed to support DDP / FSDP will come in follow up diffs. Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D40564646 fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539 Approved by: https://github.com/awgu	2022-11-05 18:28:07 +00:00
Rohan Varma	bd5b4e6504	[Easy] Unused var in functional_adam (#88292 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/88292 Approved by: https://github.com/awgu	2022-11-02 16:31:16 +00:00
Masaki Kozuki	5f26df0345	resubmit: "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" (#85739 ) Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](`849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)`) to a new header file and let FusedAdam use them to tame MSVC, hopefully. cc @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739 Approved by: https://github.com/ngimel	2022-09-29 16:58:59 +00:00
PyTorch MergeBot	7167996346	Revert "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" This reverts commit `4615d1bcfa`. Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds	2022-09-27 16:59:35 +00:00
Masaki Kozuki	4615d1bcfa	resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507 Approved by: https://github.com/ngimel	2022-09-23 18:56:00 +00:00
PyTorch MergeBot	e505360eb8	Revert "[mta] APEX style Fused Adam (#81705 )" This reverts commit `7a6c4d0c50`. Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come	2022-09-22 19:37:29 +00:00
Masaki Kozuki	7a6c4d0c50	[mta] APEX style Fused Adam (#81705 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel	2022-09-20 17:18:33 +00:00
Rodrigo Kumpera	65dc5dd3f3	[c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134 ) Those functions enable membership introspection into a ProcessGroup. A common scenario that needs this is library code that consumes a PG but doesn't create it, which means it likely doesn't know the global ranks used to create it. Translating from local to global is necessary when using c10d collectives like broadcast so if your library code adopts the convention of using local rank 0, it needs to the following: ```python import torch.distributed as dist my_pg: dist.ProcessGroup = ... def my_library_bcast(tensor) dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg) ``` This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134 Approved by: https://github.com/rohan-varma	2022-08-30 17:45:00 +00:00
joncrall	b136f3f310	More doctest refinements. (#83317 ) Follow up to #82797 Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way. @ezyang @vadimkantorov Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317 Approved by: https://github.com/ezyang	2022-08-22 20:07:26 +00:00
Rob Zinkov	ff75562cff	Adding maximize to rprop (#81864 ) Added the maximize flag #68052 to rprop optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81864 Approved by: https://github.com/albanD	2022-08-16 08:19:46 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
ProGamerGov	71d50f4f89	Change docstring type callable to Callable for consistency (#82487 ) ### Description Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function. ### Testing There shouldn't be any testing required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487 Approved by: https://github.com/albanD	2022-08-01 17:26:09 +00:00
Jerome	547e499731	Enable Zero1's ddp_with_overlap for hpu backend (#80438 ) Enable zero with ddp overlap feature along with a simple interface to insert functional optimizer to the map Signed-off-by: Jerome <janand@habana.ai> Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/80438 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-07-18 15:05:27 +00:00
anjali411	93912b1a73	Add __all__ to torch.distributed submodules (#80523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80523 Approved by: https://github.com/rohan-varma	2022-07-11 06:54:24 +00:00
PyTorch MergeBot	0b8a5ca01b	Revert "Adding maximize to rprop (#80335 )" This reverts commit `495aa9bc3a`. Reverted https://github.com/pytorch/pytorch/pull/80335 on behalf of https://github.com/albanD due to Broke rocm and windows test	2022-07-08 13:34:02 +00:00
Rob Zinkov	495aa9bc3a	Adding maximize to rprop (#80335 ) Added the maximize flag #68052 to rprop optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80335 Approved by: https://github.com/albanD	2022-07-08 08:04:38 +00:00
Rob Zinkov	a1fd5b4273	Adding maximize to RMSprop (#80326 ) Added the maximize flag #68052 to RMSprop optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80326 Approved by: https://github.com/albanD	2022-07-08 08:04:26 +00:00
wayi1	f76bb88205	fix docstring of PostLocalSGDOptimizer (#80855 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80855 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2022-07-05 14:58:35 +00:00
Michael Carilli	ba27ee9e8f	[CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862 ) Near term fix for https://github.com/pytorch/pytorch/issues/76368. Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic? A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph. Q. Ok, why not just do the capture-safe approach with device-side state variables all the time? A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling. Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here? A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like ```python graph.will_use_optimizer(opt) graph.capture_begin() ... ``` but that seems clunkier than an optimizer constructor arg. I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach. Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862 Approved by: https://github.com/ezyang	2022-06-13 01:56:47 +00:00
Olga Andreeva	b1ae519df9	Added functionality for post_local SGD (#78988 ) Fixes #74556 Added functionality to save and restore step counter for model averager. Added a unittest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78988 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-06-09 17:47:04 +00:00
Rob Zinkov	2a496e2f80	Adding maximize to Adamax (#77409 ) Added the maximize flag #68052 to Adamax optimizer and updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/77409 Approved by: https://github.com/albanD	2022-05-16 17:34:44 +00:00
Rob Zinkov	6642e88ad2	Adding maximize flag to Adagrad This adds maximize to Adagrad (#68052) along with updates the respective tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75968 Approved by: https://github.com/albanD	2022-04-20 08:29:03 +00:00
Haijunlv	08f3b95857	fix PostLocalSGDOptimizer and ModelAverager average bug Fixes #74157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74894 Approved by: https://github.com/rohan-varma, https://github.com/wayi1	2022-04-13 11:41:27 +00:00
francescocastelli	58a44523c1	Add maximize flag to Adadelta Added the maximize flag to Adadelta optimizer (#68052) and adjusted tests to take maximize into account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75330 Approved by: https://github.com/cpuhrsch	2022-04-08 20:32:35 +00:00
wayi1	189e72babe	[Model Averaging] Fix post_localSGD_optimizer I find that the original implementation of `post_localSGD_optimizer.step()` is incorrect: Whenever `averager.average_parameters()` is called, the built-in step counter will be increased. Therefore, this should only be called exactly once per `optimizer.step()`. However, if a model has multiple param groups or params, the current implementation will call `averager.average_parameters()` multiple times and over-increase the step counter. Relevant proposals since hierarchical SGD can be supported on `post_localSGD_optimizer`: https://github.com/pytorch/pytorch/issues/73382, https://github.com/pytorch/pytorch/issues/71325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74737 Approved by: https://github.com/mrshenli	2022-04-05 21:10:24 +00:00
Andrew Gu	522041a0fd	[FSDP] Add full optim state dict (#74215 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74215 ### Overview of API This PR introduces full optimizer state dict checkpointing. - This allows users to save the optimizer state for a `torch.nn.Module` (not necessarily a `FullyShardedDataParallel` instance) that contains `FullyShardedDataParallel` instances and later load that optimizer state. - This supports loading to a module with a different world size, but the `FSDP` wrapping scheme must be the same. To save the optimizer state, run the following (on all ranks): ``` model: torch.nn.Module = ... optim = torch.optim.Adam(model.parameters(), ...) # Train for some steps... full_osd = FSDP.full_optim_state_dict(model, optim) # returns non-empty dict only on rank 0 if rank == 0: torch.save(full_osd, ...) ``` To load the optimizer state, run the following (on all ranks): ``` new_model: torch.nn.Module = ... # may use different world size full_osd = torch.load(...) sharded_osd = FSDP.shard_full_optim_state_dict(full_osd, new_model) optim = torch.optim.Adam(new_model.parameters(), ...) optim.load_state_dict(sharded_osd) ``` To support multiple parameter groups, we require using an additional argument `optim_input`, which is the first argument that the user passes into the optimizer constructor. ``` optim_input = ... optim = torch.optim.Adam(optim_input, ...) FSDP.full_optim_state_dict(model, optim, optim_input) # one more argument ... new_optim_input = ... new_optim = torch.optim.Adam(new_optim_input, ...) FSDP.shard_full_optim_state_dict(full_osd, new_model, new_optim_input) # one more argument ``` One caveat is that the user should be careful of generators, which are exhausted after their first use. The `optim_input` passed into the `FSDP` APIs should be refreshed version of the generator if using generators. ### Test Plan `full_optim_state_dict()` - [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is `model.parameters()`. - [x] `full_optim_state_dict()` for a non-`FSDP` root model matches that of an equivalent local model, up to parameter IDs being rearranged, when optimizer input is multiple parameter groups (changing parameter order). `shard_full_optim_state_dict()` - [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is `model.parameters()`. - [x] `shard_full_optim_state_dict()` for a non-`FSDP` root model matches the local `optim.state_dict()` of the same model with halved world size, when optimizer input is multiple parameter groups (changing parameter order). - [x] `shard_full_optim_state_dict()` raises a `ValueError` when changing the `FSDP` wrapping scheme. On the AWS cluster, the TTS contribution for these tests is ~45 seconds. ### Developer Notes Relaxing the Problem For optimizer state checkpointing, we have relaxed the problem to not support changing the `FSDP` wrapping scheme between save and load time. It is unclear how to solve without this relaxation. This was the least restrictive way to relax the problem since it does not affect most expected use cases. Rather, the expected change between save and load time is the world size, which this implementation does support. Even with the relaxation, the `optim_input` argument is necessary to determine the `flat_param_id_to_param` mapping, which is important to know which parameter IDs in the flattened space correspond to `FlatParameter`s that hence need to be unflattened. Differences with Local Equivalent Suppose `full_osd = full_optim_state_dict()` and `local_osd = state_dict()` for a purely local equivalent. The difference between `full_osd` and `local_osd` is that the parameter IDs of unflattened parameters comprising a single flattened parameter are always consecutive in `full_osd`, while they may be non-consecutive in `local_osd`. Suppose in the following that each layer has 1 parameter `param`: ``` FSDP(model) layer1 FSDP(layer2) layer3 ``` `layer1.param` and `layer3.param` are flattened and attributed to `model`. `layer2.param` is flattened and attributed to itself. - In `local_osd`, the parameter IDs would be `0: layer1.param`, `1: layer2.param`, and `2: layer3.param`. - In `full_osd`, the parameter IDs would be `0: layer1.param`, `1: layer3.param`, and `2: layer2.param`. (Parameter IDs of unflattened parameters sharing a flattened parameter are consecutive.) The idea is that as long as `full_optim_state_dict()` and `shard_full_optim_state_dict()` are internally consistent, then there is no need to match the local equivalent (assuming no change in `FSDP` wrapping). ### Follow-Ups API - If needed, we can follow-up this PR by adding an argument `key_by_name: bool = False` to both methods that may be set to `True` to key parameters by `str` names instead of `int` parameter IDs. We still need to investigate if keying by name enables changing the `FSDP` wrapping scheme. Refactoring - In this optimizer state checkpointing, all optimizer state is saved to CPU on rank 0 (set as `OPTIM_TARGET_RANK`). We should unify and refactor these assumptions with model state checkpointing. Testing - The code path for unused parameters is not tested. The testing and any needed implementation fixes can be done in a follow-up. - The code path for non-tensor states (e.g. `Adam` `"step"` as `float` instead of as zero-dimension `FloatTensor`) is not tested. However, it is identical to that of zero-dimension tensor states, so I have some confidence. If needed, I can add tests for it in a follow-up. - Would I have to write my own optimizer? I do not want to introduce dependencies on third party libraries like Nvidia `apex`. - We may want to add end-to-end checkpointing tests that include both model state dict and optimizer state dict. Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D35045121 Pulled By: awgu fbshipit-source-id: 33c650dc960acbd7613d4f444a852b9f76ca4a9b (cherry picked from commit 2bbc2e344296dc455cf686f3a9b097989504be81)	2022-03-30 14:15:23 +00:00
Andrew Gu	9012e8d65a	[ZeRO][BE] Clean up ZeRO tests (#73842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842 Overview This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: `1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)` - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. Questions - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34675709 Pulled By: awgu fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb (cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)	2022-03-08 13:15:20 +00:00
Can Balioglu	e1db2f13ce	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166 This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started. ghstack-source-id: 149778566 Test Plan: Run the existing unit tests. Reviewed By: rohan-varma Differential Revision: D34371226 fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b (cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)	2022-02-24 02:33:05 +00:00
Andrew Gu	c30659ffcc	[ZeRO] (Reland) Add ctor support for multiple param groups (#72932 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/72578. Overview Windows CI was failing due to the multi-rank single-GPU case (see [here](https://github.com/pytorch/pytorch/runs/5204906995?check_suite_focus=true)). To address this, I - added `common_distributed.skip_if_no_gpu` for `test_multiple_param_groups()` to ensure that each rank can safely call `to(self.device)` -- this targets the expected SPSD use case where each rank has its own GPU; - moved `test_constructor()` back to `TestZeroRedundancyOptimizerSingleRank` to check that the multiple parameter group method for construction works even on a single rank. Test Plan - I checked both tests for CPU, 1 GPU, 2 GPUs, 4 GPUs, and 8 GPUs. - I added the `ciflow/win` label to run the failing Windows CI test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72932 Reviewed By: rohan-varma Differential Revision: D34281482 Pulled By: awgu fbshipit-source-id: c4fe604ddd9d2c123c3071249741e6b8a6454b6e (cherry picked from commit `6bea9bcc63`)	2022-02-22 16:29:55 +00:00
Nikita Shulga	84cb810b3f	Revert D34106940: [ZeRO] Add ctor support for multiple param groups Test Plan: revert-hammer Differential Revision: D34106940 (`5dd0732457`) Original commit changeset: 7e70fc0b3cec Original Phabricator Diff: D34106940 (`5dd0732457`) fbshipit-source-id: 08f846c9c02be8756475f4e0b57eb381f10c27bd (cherry picked from commit `7675497d83`)	2022-02-16 03:45:15 +00:00
wayi1	8b08478115	Fix the doc of PostLocalSGDState (#72792 ) Summary: The first arg of `PostLocalSGDState` ctor, `process_group`, cannot be empty. Here to simplify the usage, does not even create a subgroup explicitly. See the example in unit test: `4feef6c970/torch/testing/_internal/distributed/distributed_test.py (L4260)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/72792 Reviewed By: samdow Differential Revision: D34213221 Pulled By: rohan-varma fbshipit-source-id: 078343f3ee138e175bf835897f190032eb970662 (cherry picked from commit `bf90af704f`)	2022-02-15 23:47:12 +00:00
Mikayla Gawarecki	2a5aaf1c49	Optim foreach cleanup for AdamW (#70484 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70484 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767869 Pulled By: mikaylagawarecki fbshipit-source-id: 2f5273bbfeea3ed502c5d77da4bebe1674243e86 (cherry picked from commit `2dd9b77917`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	dff58d519f	Optim foreach cleanup for Rprop (#70483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70483 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767866 Pulled By: mikaylagawarecki fbshipit-source-id: ffc5ae68eeea8fa09385862b853b731554b77bcb (cherry picked from commit `3a0fe29580`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	ce3094f5f6	Optim foreach cleanup for Rmsprop (#70482 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70482 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767862 Pulled By: mikaylagawarecki fbshipit-source-id: 8e2e9c986d5a3774093a79755940372945f1b3a9 (cherry picked from commit `baea537277`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	2cb03e926f	Optim foreach cleanup for SGD (#70481 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70481 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767868 Pulled By: mikaylagawarecki fbshipit-source-id: 89b9227a4ddf99602855973cbc343c58ae3d5328 (cherry picked from commit `ffea8ddcfd`)	2022-02-15 18:02:08 +00:00
Mikayla Gawarecki	5f9590681d	Optim foreach cleanup for Adam (#70295 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70295 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767870 Pulled By: mikaylagawarecki fbshipit-source-id: f922f15ecb0307458c8ecee737325c42c4f3ce8b (cherry picked from commit `66233a8a3e`)	2022-02-15 18:02:08 +00:00
Andrew Gu	5dd0732457	[ZeRO] Add ctor support for multiple param groups (#72578 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72578 Overview This adds `ZeroRedundancyOptimizer` constructor support for multiple parameter groups (i.e. passing an `iterable` of `dict`s instead of an `iterable` of `torch.Tensor` as the `parameters` argument) to mirror the API for non-sharded optimizers. Fixes https://github.com/pytorch/pytorch/issues/71347 and https://github.com/pytorch/pytorch/issues/59973. This modifies `test_collect_shards()` to skip if ROCm. Test Plan I adjusted the existing constructor test, and I added a test for parity between constructing with two parameter groups up front versus constructor with one parameter group and adding the second parameter group after (via `add_param_group()`) versus a non-sharded optimizer. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34106940 Pulled By: awgu fbshipit-source-id: 7e70fc0b3cec891646e0698eaedf02ff4354c128 (cherry picked from commit `40f2d45172`)	2022-02-15 16:51:30 +00:00
Yuxin Wu	1ed4653e89	Stop writing logs to root logger (#72649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/72648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649 Reviewed By: soulitzer Differential Revision: D34172113 Pulled By: mrshenli fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf (cherry picked from commit `c14297cee6`)	2022-02-11 21:30:53 +00:00
Mikayla Gawarecki	d9acfef831	Optim foreach cleanup for Adamax (#69982 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69982 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767865 Pulled By: mikaylagawarecki fbshipit-source-id: c5efd351e359825d38b71f57a2c61a2055c3c114 (cherry picked from commit `37bb80c2d7`)	2022-02-09 16:52:13 +00:00
Mikayla Gawarecki	dabfea8363	Optim foreach cleanup for Adagrad (#69981 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69981 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767863 Pulled By: mikaylagawarecki fbshipit-source-id: 1c99abe4ac4eb2a9eb896dff4837b539b94f68e7 (cherry picked from commit `61c28d0645`)	2022-02-09 16:52:12 +00:00
Mikayla Gawarecki	8e8d170674	Optim foreach cleanup for Adadelta (#69980 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69980 - Merged `torch/optim/adadelta.py` and `torch/optim/_multitensor/adadelta.py` into `torch/optim/adadelta.py` - Moved adadelta functional forms from `torch/optim/_functional.py` and `torch/optim/_multi_tensor/_functional.py` to `torch/optim/adadelta.py` - `torch/optim/_functional.py` just imports from `torch/optim/adadelta.py` - Added a test `test_optimizers_foreach_flag` which replicates `test_multi_tensor_optimizers` in `test/test_optim.py` - Add a test `test_adadelta_new` that replicates the behavior of `test_adadelta` but with `foreach` flag instead of using the multitensor adadleta class. If we delete `_multitensor/` we could replace `test_adadelta` with this Remaining TODO: - [ ] single_tensor adadelta supports complex but multitensor does not, need to integrate the singletensor logic in multitensor and switch the `test_adadelta_complex` to test for foreach in [True, False] Test Plan: Imported from OSS Reviewed By: VitalyFedyunin, albanD Differential Revision: D33413059 Pulled By: mikaylagawarecki fbshipit-source-id: 92a9fa98705762bb6bd464261671e49aef40070e (cherry picked from commit `a008227d22`)	2022-02-09 16:52:12 +00:00
Mikayla Gawarecki	7176c92687	[optim] update step in functional and pass state_steps instead of state (#71333 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71333 Updated - Adagrad - Adamax - Adam - AdamW - RAdam make multi_tensor functionals take `state_steps: List[Tensor]` instead of taking `states: List[Dict]` make `state_steps: List[int]s -> state_steps:List[Tensor]` where each is a Singleton tensor so step can be updated within the functional (NAdam and ASGD) were updated in separate diffs to fold their handling of state into the functionals Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D33767872 Pulled By: mikaylagawarecki fbshipit-source-id: 9baa7cafb6375eab839917df9287c65a437891f2 (cherry picked from commit `831c02b3d0`)	2022-02-08 16:51:19 +00:00
Rohan Varma	bdcdf94bdd	[Opt Overlap] Clean up code in _OptimizerHookState (#71620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620 Remove from_functional_optim and make it the default constructor since that is the only way _OptimizerHookState is now being built. Also, no longer need to expose create_functional_optim helper function ghstack-source-id: 147577174 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33700593 fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14 (cherry picked from commit `a50b2caf0e`)	2022-01-26 19:33:49 +00:00
Rohan Varma	f5a71ec2d6	[Opt Overlap] Implement as_functional_optim and create_functional_optim (#71604 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604 Implement 2 helper functions: - as_functional_optim which takes in a torch.optim class type and arguments and creates the corresponding functional optimizer. - create_functional_optim which takes in the functional optimizer class type and constructs it. Note that as_functional_optim calls into create_functional_optim. The first will be used in future PRs as described in https://github.com/pytorch/pytorch/issues/67570 to create a functional optimizer from a traditional optimizer. The latter is used in _OptimizerHookState to create a functional optimizer. Both new helper functions are covered by unittests. ghstack-source-id: 147577170 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33688995 fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1 (cherry picked from commit `42fdae2991`)	2022-01-25 18:32:13 +00:00

1 2 3

126 Commits