pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Matt Pitkin	8a5dd7f59b	Allow SequentialLR to include ChainedScheduler (#133450 ) This fixes #132745 and allows a `SequentialLR` to include schedulers that are compound scheduler types (i.e., a `ChainedScheduler`), which contain a list of schedulers in a `_schedulers` attribute. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133450 Approved by: https://github.com/janeyx99	2024-10-18 02:29:38 +00:00
Daniel Velkov	4abe38bc94	RMSprop docs: add missing input "epsilon" (#137854 ) Adding a missing input argument in the docs for RMSprop. Like in the doc for AdamW https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/137854 Approved by: https://github.com/janeyx99	2024-10-15 16:40:42 +00:00
ErezYosef	197601eeea	Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107 ) A proposal addressing Issue #1489: Optimizer should track parameter names and not id. (also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552) ## Summary This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id. Optimizers can be initialized with `named_parameters()` as: ```python optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9) ``` This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as: ``` state_dict = { 'state': { 0: {'momentum_buffer': tensor(...), ...}, 1: {'momentum_buffer': tensor(...), ...}, }, 'param_groups': [ { 'lr': 0.01, 'weight_decay': 0, ... 'params': [0,1] 'param_names' ['layer.weight', 'layer.bias'] (optional) } ] } ``` Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored. ## Key Features #### Named Parameters in Optimizer Initialization: Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly. #### Parameter Names in `state_dict`: The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters. ## Backward Compatibility #### No Breaking Changes: This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer. #### Customization with Hooks: For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs. ## Documentation Updates Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively. ## Solution Example: A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order. The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict : ```python def adapt_state_dict_ids(optimizer, state_dict): # assuming a single param group. current_state_group = optimizer.state_dict()['param_groups'][0] loaded_state_group = state_dict['param_groups'][0] # same number of params, same names, only different ordering current_state_name_to_id_mapping = {} # mapping -- param_name: id for i, name in enumerate(current_state_group['param_names']): current_state_name_to_id_mapping[name] = current_state_group['params'][i] # changing the ids of the loaded state dict to match the order of the given state dict. for i, name in enumerate(current_state_group['param_names']): loaded_state_group['params'][i] = current_state_name_to_id_mapping[name] return state_dict ``` In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`. Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict. ### Note This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-10-14 19:24:44 +00:00
Jane Xu	f9ed39c989	Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637 Approved by: https://github.com/albanD	2024-10-10 01:23:30 +00:00
Jane Xu	972822dea1	Minorly reorder optim kwargs in docs, fixes #137391 (#137531 ) Closes #137391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531 Approved by: https://github.com/albanD	2024-10-09 04:14:45 +00:00
Jane Xu	ddc7b6d0b4	Removes confusing note, addresses #38006 (#137535 ) Fixes #38006 The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535 Approved by: https://github.com/albanD	2024-10-09 04:00:38 +00:00
Jane Xu	b16167874d	Minor SGD docs clarification fixing #137356 , #137352 (#137528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528 Approved by: https://github.com/albanD	2024-10-08 23:05:08 +00:00
Sunishchal Dev	a8ed873ba2	Add missing input "eps" to adam docs (#135191 ) Minor fix for missing input argument in the Adam optimizer docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191 Approved by: https://github.com/janeyx99	2024-09-25 20:17:23 +00:00
Mauricio Villegas	ece8267d2c	Add back optim type hints that were lost when .pyi files were removed (#136185 ) When stub files (`.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185 Approved by: https://github.com/janeyx99	2024-09-17 15:45:15 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit `55299cfc22`. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
Jane Xu	b1612569f6	[BE] Clarify defaulting behavior in optimizer (#135384 ) Fixes #135340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384 Approved by: https://github.com/drisspg, https://github.com/jainapurva	2024-09-06 21:52:55 +00:00
Nikita Shulga	fb1c580892	[BE][optim] Make pyright recognize exported symbols (#135043 ) Follows pattern introduced by https://github.com/pytorch/pytorch/pull/80955 which [pyright](https://github.com/microsoft/pyright) prefers over `__all__` symbol, see https://github.com/microsoft/pylance-release/issues/2953#issuecomment-1168956296 Fixes https://github.com/pytorch/pytorch/issues/134985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135043 Approved by: https://github.com/janeyx99	2024-09-04 21:53:46 +00:00
Wil Kong	de06345e9b	Avoid Host & Device Sync In LR Scheduler (#133663 ) Fixes #133662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133663 Approved by: https://github.com/janeyx99, https://github.com/eqy Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-22 03:52:43 +00:00
Sahdev Zala	06cc2e83f0	Make optim.swa.util content accessible from the torch.optim doc (#133393 ) Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc. Currently, if you click the link, https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`. Also, `torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393 Approved by: https://github.com/janeyx99	2024-08-21 00:43:46 +00:00
Masaki Kozuki	702c810780	move param's device check to `_init_group` for fused (#131153 ) There could be some cases where the params have the meta device when calling optimizer's dunder init and those params are materialized in the first computation. This change would allow such situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131153 Approved by: https://github.com/mlazos, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-17 04:49:47 +00:00
Jane Xu	c23dceb8f1	Add Adafactor foreach impl (#132336 ) This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR: - we have a foreach flag for Adafactor - It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency. Next steps: - make torch.compile possible on it - make it faster (by adding more foreach apis) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336 Approved by: https://github.com/albanD ghstack dependencies: #133360	2024-08-15 17:00:33 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Jane Xu	14750dd737	Correct return type of grouping helper function in Optimizer (#133360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133360 Approved by: https://github.com/albanD	2024-08-14 01:56:02 +00:00
Pierre Chapuis	0e4c0ef29f	fix type of `eta_min` parameter in CosineAnnealing (int -> float) (#132482 ) This fixes errors with type checkers such as `pyright`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132482 Approved by: https://github.com/janeyx99	2024-08-12 18:22:26 +00:00
PyTorch MergeBot	cbee9c1fd2	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `0e7e61f7ce`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))	2024-08-07 00:05:20 +00:00
Xuehai Pan	0e7e61f7ce	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-03 09:43:38 +00:00
xinyu-intel	2ee9895304	Support optimizer capturable on hpu and xpu (#132119 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/132119 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-08-02 08:19:52 +00:00
Xuehai Pan	30293319a8	[BE][Easy][19/19] enforce style for empty lines in import segments in `torch/[o-z]*/` (#129771 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2024-08-01 17:07:14 +00:00
Joel Schlosser	e6cddc9271	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-30 18:42:54 +00:00
Jane Xu	3816f6420a	[BE] remove unnecessary _dispatch_sqrt by using 0.5 (#131358 ) Based on the discussion here where 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358 Approved by: https://github.com/albanD	2024-07-30 18:08:17 +00:00
PyTorch MergeBot	8f5cf46405	Revert "Fix public API tests (#131386 )" This reverts commit `91fcfd8760`. Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))	2024-07-28 03:23:04 +00:00
Matthew Hoffman	fdf1451bfa	Add `__all__` to torch.optim to define public interface (#131959 ) There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15) ``` error: "SGD" is not exported from module "torch.optim" ``` Adding these classes/modules to `__all__` fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959 Approved by: https://github.com/ezyang	2024-07-27 01:03:25 +00:00
Joel Schlosser	91fcfd8760	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-26 23:38:43 +00:00
PyTorch MergeBot	e4db5dc1c4	Revert "[BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358 )" This reverts commit `4c7f22dee2`. Reverted https://github.com/pytorch/pytorch/pull/131358 on behalf of https://github.com/janeyx99 due to Internal uses this private API and landing that has been a pain so we're reverting this first ([comment](https://github.com/pytorch/pytorch/pull/131358#issuecomment-2253190654))	2024-07-26 17:35:27 +00:00
PyTorch MergeBot	c9888c2739	Revert "[BE] typing for decorators - optim/optimizer (#131583 )" This reverts commit `a1dad77dfa`. Reverted https://github.com/pytorch/pytorch/pull/131583 on behalf of https://github.com/atalman due to Breaks CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10105959146/job/27947741162) [HUD commit link](`a1dad77dfa`) ([comment](https://github.com/pytorch/pytorch/pull/131583#issuecomment-2252784280))	2024-07-26 13:41:22 +00:00
Aaron Orenstein	a1dad77dfa	[BE] typing for decorators - optim/optimizer (#131583 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131583 Approved by: https://github.com/janeyx99 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581, #131582	2024-07-26 05:00:07 +00:00
Jane Xu	9c4cf866c2	Adafactor forloop basic impl (#129905 ) #109581 At this point, the vanilla implementation (the default) is good. Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor Specifically, the impl in this PR, which attempts to replicate the paper, ``` optim = torch.optim.Adafactor([weight]) ``` is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor ``` optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False) ``` is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor ``` optim = keras.optimizers.Adafactor(learning_rate=0.01) ``` The three results respectively for the same randomly generated weights: ``` # ours tensor([[ 0.3807594, -0.3912092], [ 0.0762539, 0.5377805], [ 0.2459473, 0.4662207]]) # pytorch-optimizer tensor([[ 0.3807592, -0.3912172], [ 0.0762507, 0.5377818], [ 0.2459457, 0.4662213]]) # keras array([[ 0.38076326, -0.39121315], [ 0.0762547 , 0.5377859 ], [ 0.24594972, 0.46622536]], dtype=float32) ``` This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences: * keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01. * We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling. <details> Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing My script repro: ``` import torch from pytorch_optimizer import AdaFactor torch.set_printoptions(precision=7) weight = torch.tensor([[ 0.37697506, -0.39500135], [ 0.07246649, 0.53399765], [ 0.24216151, 0.46243715]], dtype=torch.float32) # bias = torch.tensor([0, 0], dtype=torch.float32) weight.grad = torch.tensor([[-0.5940447, -0.7743838], [-0.5940447, -0.7743838], [-0.5940447, -0.7743838]], dtype=torch.float32) # bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32) weight_c = weight.clone() weight_c.grad = weight.grad.clone() optim = torch.optim.Adafactor([weight]) optim.step() print(weight) optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False) optim_c.step() print(weight_c) ``` <details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905 Approved by: https://github.com/albanD	2024-07-25 13:17:19 +00:00
Jane Xu	4c7f22dee2	[BE] remove unnecessary _dispatch_sqrt by using 0.5 (#131358 ) Based on the discussion here where 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358 Approved by: https://github.com/albanD	2024-07-24 14:58:57 +00:00
hxwang	276b5238ef	[bug] Add is_compiling check for optimizers to avoid untracked tensor during graph tracing (#130909 ) Hey folks, I was using the `stateless_func` [here](`7c45476d38/torch/distributed/_spmd/api.py (L435)`), which worked well before [this commit](https://github.com/pytorch/pytorch/pull/111084) but then introduced a `_tensor_constant0` and made this func non-stateless. Since there is no way to retrieve this constant tensor before compilation and performance is not an issue when tracing a graph, I think it might be good to fall back to the other branch. ![image](https://github.com/user-attachments/assets/6ee4487d-456b-47e0-8c1d-66cb5a641d47) ![image](https://github.com/user-attachments/assets/1ed46502-e50e-45c4-9751-49aa5a4590ae) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130909 Approved by: https://github.com/mlazos	2024-07-24 08:29:27 +00:00
Aaron Orenstein	5a0068cc69	[BE] mypy: disallow untyped decorators (#131428 ) Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations. Step 1 - Enable the error and override in all the offending files. #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428 Approved by: https://github.com/justinchuby, https://github.com/oulgen	2024-07-23 21:50:55 +00:00
Li-Huai (Allan) Lin	99d9b369f4	[Optim] Support tensor lr for all optimizers and check it is 1-element (#131065 ) Fixes: #130980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065 Approved by: https://github.com/janeyx99	2024-07-23 04:27:05 +00:00
Tianyi Tao	3477ee38e4	fix the use of initial learning rate in the OneCycleLR example (#130306 ) Fixes #127649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306 Approved by: https://github.com/janeyx99	2024-07-09 18:58:07 +00:00
Li-Huai (Allan) Lin	8ec5ba960f	[MPS] Add tensor_lr overloads to fused adam & adamw (#129451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451 Approved by: https://github.com/janeyx99	2024-07-02 19:46:30 +00:00
Michael Lazos	aa7ea6b45c	Add wraps back (#129933 ) Fixes https://github.com/pytorch/pytorch/issues/129922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129933 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-07-02 18:24:02 +00:00
Sahdev Zala	9795dba1e0	Optim package docstring fix (#129086 ) Fix docstrings in various files in optim package. This is a last remaining fix for the issue #112593 The fix can be verified by running pydocstyle path-to-file --count Fixes #112593 Related #128248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129086 Approved by: https://github.com/janeyx99	2024-06-21 14:30:53 +00:00
Li-Huai (Allan) Lin	9a7e2519d3	[MPS] Fused Adam & AdamW (#127242 ) Summary: This PR adds fused Adam and AdamW implementations. Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory: Fast math enabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 89 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 90 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 83 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 12 \| 94 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 88 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 12 \| 90 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 100 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 23 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 23 \| 98 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 480 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 72 \| 450 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 82 \| 450 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 73 \| 420 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 91 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 83 \| 400 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 78 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 170 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 140 \| 600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 170 \| 600 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 140 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 250 \| 890 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 220 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 250 \| 830 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 220 \| 770 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 270 \| 870 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 230 \| 840 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 270 \| 810 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 240 \| 800 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 400 \| 1000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 360 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 430 \| 2000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 360 \| 1300 Times are in milliseconds (ms). ``` Fast math disabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 79 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 93 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 10 \| 90 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 91 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 81 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 34 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 34 \| 95 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 92 \| 430 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 81 \| 390 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 98 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 88 \| 430 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 100 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 88 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 210 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 190 \| 610 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 210 \| 510 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 190 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 300 \| 900 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 260 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 295 \| 900 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 260 \| 800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 320 \| 910 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 280 \| 900 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 320 \| 900 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 300 \| 900 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 500 \| 2000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 480 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 540 \| 1500 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 480 \| 1200 Times are in milliseconds (ms). ``` ```python def profile_fused_adam(): from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)] if adamWflag: fn = adamw.adamw else: fn = adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label='Fused Adam', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242 Approved by: https://github.com/kulinseth, https://github.com/janeyx99	2024-06-18 19:59:50 +00:00
Ahmed Gheith	3dd5f0ecbb	Remove circular import (#128875 ) Summary: A spurious import is causing circular dependency errors Test Plan: phabricator signals Differential Revision: D58685676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128875 Approved by: https://github.com/kit1980	2024-06-18 12:30:13 +00:00
Sahdev Zala	4ccbf711e2	Learning Rate Scheduler docstring fix (#128679 ) Fix docstrings in Learning Rate Scheduler. The fix can be verified by running pydocstyle path-to-file --count Related #112593 BEFORE the PR: pydocstyle torch/optim/lr_scheduler.py --count  92  AFTER the PR: pydocstyle torch/optim/lr_scheduler.py --count  0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679 Approved by: https://github.com/janeyx99	2024-06-15 05:30:35 +00:00
Xuehai Pan	dcc0093dba	[BE][Easy] export explicitly imported public submodules (#127703 ) Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703 Approved by: https://github.com/ezyang	2024-06-12 05:52:18 +00:00
PyTorch MergeBot	90bb510ece	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `348b181a97`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))	2024-06-10 20:44:42 +00:00
Aaron Orenstein	27f9d3b0a1	Flip default value for mypy disallow_untyped_defs [8/11] (#127845 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127845 Approved by: https://github.com/oulgen ghstack dependencies: #127842, #127843, #127844	2024-06-08 18:49:56 +00:00
Xuehai Pan	348b181a97	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007	2024-06-08 15:25:03 +00:00
joncrall	80fa2778ed	Update types for verbose in lr_scheduler (#127943 ) I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool \| str`. This PR adds a `\| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943 Approved by: https://github.com/janeyx99	2024-06-06 21:59:22 +00:00
Adam J. Stewart	80d34217c6	Typo fixes: et al. (#127811 ) "et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811 Approved by: https://github.com/janeyx99	2024-06-06 01:03:25 +00:00
GdoongMathew	3437177e2b	Quick Fix on #126854 , deepcopy `lr` and other possible `base_parameters` (#127190 ) * Apply `deepcopy` to every base parameters (`initial_lr`, `max_lr`) when instantiating `LRScheduler`. Fixes #126854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127190 Approved by: https://github.com/janeyx99	2024-06-03 18:06:31 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit `749a132fb0`. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
feifan	da9fb670d2	Nadam support the flag for "maximize" (#127214 ) Fixes https://github.com/pytorch/pytorch/issues/126642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127214 Approved by: https://github.com/janeyx99	2024-05-31 01:11:16 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
feifan	22712ba5c5	Radam support the flag for "maximize" (#126765 ) Fixes #[126642](https://github.com/pytorch/pytorch/issues/126642) I reference the maximize in `Adam` and add `Radam's` maximize flag. If this pr is OK, I will add another pr for `Nadam`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126765 Approved by: https://github.com/janeyx99	2024-05-27 06:34:50 +00:00
Xuehai Pan	ba3b05fdf3	[1/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort stdlib (#127122 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122 Approved by: https://github.com/kit1980	2024-05-25 08:25:50 +00:00
Jane Xu	665637714f	Remove SparseAdam weird allowance of raw Tensor input (#127081 ) This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used. ------ # BC Breaking note As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts" ``` import torch param = torch.rand(16, 32) optimizer = torch.optim.SparseAdam(param) ``` Instead you should replace the last line with ``` optimizer = torch.optim.SparseAdam([param]) ``` to no longer error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081 Approved by: https://github.com/soulitzer	2024-05-25 02:58:24 +00:00
Aart Bik	ff82e2e7cf	[traced-graph][sparse] propagate sparsity metadata into traced graph (#117907 ) Propagate sparsity metadata from sparse tensors of torch.sparse into the traced graph representation (with would be useful for a JIT backend that supports a "sparse compiler"). This is a first careful attempt, since the actual "meta" feature seem still incomplete for coo and completely lacking for csr/csc/bsr/bsc. For background see forum postings (with examples): https://discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/195145 https://dev-discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/1803 And feature request: https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117907 Approved by: https://github.com/pearu, https://github.com/ezyang	2024-05-23 22:46:46 +00:00
David Chiu	7e166e8057	[optim] Fix: wrong ASGD implementation (#126375 ) This PR is based on #125440, additionally merging the latest main branch and fixing the lint failures from #126361. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126375 Approved by: https://github.com/janeyx99	2024-05-17 15:46:39 +00:00
PyTorch MergeBot	e3c5d1b7d7	Revert "[optim] Fix: wrong ASGD implementation (#125440 )" This reverts commit `2c5ad9a3d7`. Reverted https://github.com/pytorch/pytorch/pull/125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](https://github.com/pytorch/pytorch/pull/125440#issuecomment-2113833108))	2024-05-16 02:12:29 +00:00
haozhe.zhu	f9d107af66	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-16 01:11:51 +00:00
David Chiu	2c5ad9a3d7	[optim] Fix: wrong ASGD implementation (#125440 ) > previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor. - [X] Ill assumption that every param will have the same step. - [x] DIfferent implementation between `foreach=Ture` and `foreach=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440 Approved by: https://github.com/janeyx99	2024-05-15 22:52:15 +00:00
David Chiu	1a28f731dc	[optim] Merge the pyi files into py files of optimizer (#125452 ) Continue the work of pytorch/pytorch#125153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125452 Approved by: https://github.com/janeyx99	2024-05-14 18:24:50 +00:00
David Chiu	9641a8db25	[optim] deprecate `LRScheduler.print_lr` (#126105 ) Fixes #99270 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126105 Approved by: https://github.com/janeyx99	2024-05-14 14:13:03 +00:00
daitian1995	b805d3cbcb	Modify device check in capturable optimizer to support more devices (#124919 ) Fixes #124830 Modify device check in capturable optimizer to support more device Pull Request resolved: https://github.com/pytorch/pytorch/pull/124919 Approved by: https://github.com/janeyx99	2024-05-14 05:56:00 +00:00
PyTorch MergeBot	bd3cbdba2f	Revert "[optim] add fused_adagrad support for CPU device (#124905 )" This reverts commit `1c3fe84033`. Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk `1c3fe84033` ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))	2024-05-13 20:53:22 +00:00
haozhe.zhu	1c3fe84033	[optim] add fused_adagrad support for CPU device (#124905 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_adagrad time: 0.2500 seconds _fused_adagrad time: 0.0933 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_adagrad time: 2.8819 seconds _fused_adagrad time: 1.7591 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-05-13 01:16:20 +00:00
Michael Lazos	b833fc0ecb	Tighten fallback conditions for compiled optim (#125825 ) Since we now will support `capturable=False` when it's valid, narrow the eager fallback conditions to the cases where `compile` will fail. The lone case here is when the user deletes the capturable flag; `state_steps` are on cuda and `capturable` is `False`. Because a cuda tensor is not supported in the `value` kwarg for foreach ops this results in an error. The fallback wrapper is changed to check the device of `state_steps` if `capturable=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125825 Approved by: https://github.com/janeyx99	2024-05-11 06:29:51 +00:00
David Chiu	31946c10d0	Add missing parameter doc of Adagrad (#125886 ) Add the missing documentation for `initial_accumulator_value` parameter in Adagrad, and update the algorithm description in the documentation (adjusted to reflect the implementation). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125886 Approved by: https://github.com/janeyx99	2024-05-10 22:55:22 +00:00
David Chiu	c520929c83	add typing in torch.optim.lr_scheduler (#125556 ) Merge torch/optim/lr_scheduler.pyi into torch/optim/lr_scheduler.py Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125556 Approved by: https://github.com/janeyx99	2024-05-10 19:28:00 +00:00
Michael Lazos	69eeef0727	Update LRScheduler to handle tensor LR (#123753 ) Enables LRScheduler to handle tensor LRs. Note on test changes: For the test modifications I just removed itertools.product and created two loops. This allows us to create a new set of optim_inputs on each iteration to prevent mutations on the tensor LR carrying over across iterations. Nothing else in those tests was modified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123753 Approved by: https://github.com/janeyx99 ghstack dependencies: #123751, #123752	2024-05-09 00:52:43 +00:00
Michael Lazos	7b36b4a765	Fix user warning for tensor LR (#123752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123752 Approved by: https://github.com/janeyx99 ghstack dependencies: #123751	2024-05-09 00:52:43 +00:00
Michael Lazos	0ea6ffc613	Swap warning counter to flag in LRScheduler (#123751 ) This was a counter previously, this should be a flag to indicate whether or not the optimizer step has been called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123751 Approved by: https://github.com/janeyx99	2024-05-09 00:52:43 +00:00
Michael Lazos	0f02e0aa39	Disable dynamo on functional optims if capturable=False (#123619 ) This resolves a bug in eager where if an old state dict is loaded (without the capturable flag) but the original dict had the capturable flag, then state_steps would be on cuda but we would take the non-capturable path. We now fallback to eager if capturable=False. Current design doc and discussion: https://docs.google.com/document/d/1DmmbiaSp16CDZtGw1qzXKHFTY_0gqc0xpnBdviXq0vk/edit#heading=h.871u7bvwz7ze Note on the actual fallback logic - there was an issue with torchscript originally not handling args, *kwargs properly, after rectifying that by using `functools.wraps`, there was an additional bug with scoping which required the single tensor implementation to be in the global scope at the time of the fallback closure being created. I pass in the single tensor function to the `_disable_dynamo_if_unsupported` decorator to workaround this bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123619 Approved by: https://github.com/janeyx99	2024-05-07 22:17:01 +00:00
David Chiu	a60fa960e5	refactor: extract `get_lr` warning (#125545 ) Extract the `_get_lr_called_within_step` checking in the `get_lr()` of every LRSchedulers. Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125545 Approved by: https://github.com/janeyx99	2024-05-07 03:15:58 +00:00
David Chiu	b1b03992d0	Merge the pyi files into py files of optimizer (#125153 ) Merge the interfaces in pyi files into py files in `torch/optim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125153 Approved by: https://github.com/janeyx99	2024-05-02 21:29:31 +00:00
Michael Lazos	787afc5180	Add LR as tensor tests (#123750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750 Approved by: https://github.com/janeyx99	2024-05-01 04:46:49 +00:00
Alex Morehead	9aed5dcfe6	Clarify wording in docstring for `CosineAnnealingWarmRestarts` within `lr_scheduler.py` (#125161 ) - Clarifies wording in the docstring for `CosineAnnealingWarmRestarts` within `lr_scheduler.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125161 Approved by: https://github.com/janeyx99	2024-04-30 14:01:22 +00:00
haozhe.zhu	3c964ad1ca	add fused_sgd_kernel support for CPU device (#123629 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9 https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_sgd time: 0.2301 seconds _fused_sgd time: 0.0925 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_sgd time: 2.6195 seconds _fused_sgd time: 1.7543 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-23 08:28:19 +00:00
GdoongMathew	8b1ad51881	Better Error Message in `ChainedScheduler` and `SequentialLR` (#121633 ) Fixes #121577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633 Approved by: https://github.com/janeyx99	2024-04-19 13:37:41 +00:00
Jane Xu	b412b75b42	[optim] add fused_adam/adamw_kernel support for CPU device (#123074 ) On par with `CUDA` implementation. For `autocast` logic, same with `CUDA` + `Fused Adam`: - check inf in `gradscalar.step` - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param. TestPlan: ``` # extend CUDA only test for CPU fused adagrad python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_torch.py -k test_grad_scaling_autocast_fused # extend fused test python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step python test_optim.py -k test_can_load_older_state_dict # newly added test (follow `6b1f13ea2f/test/test_cuda.py (L1108)`) python test_optim.py -k test_grad_scaling_autocast_fused_optimizers ``` Benchmark: 5.1x on 56 core SPR Parameter-size=1M Nparams=10 [test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7) ``` numactl -C 0-55 -m 0 python bench_adam.py non-fused 6.0174267292022705 s fused 1.1787631511688232 s ``` Note: Fused kernel accuracy The accuracy failure in CI shows a little higher than default tolerance ``` 2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%) 2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed) 2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed) ``` I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations. For example, in non-fused impl ``` exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` and in fused impl ``` exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d]; // std::cout << "exp_avg_sq " << exp_avg_sq_ptr[d] << std::endl; exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] + scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val; ``` If I keep `std::cout`, I can get exactly same results in UT ``` ===============param 0.6796758770942688 0.6796758770942688 ``` But when I comment out it, there will be a difference ``` ===============param 0.6796758770942688 0.6796759366989136 ``` So I will make the tolerance a little higher than default one. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-19 11:14:04 +00:00
Michael Lazos	57a3dc56d4	Small Adamax fix (#123498 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123498 Approved by: https://github.com/janeyx99	2024-04-18 00:50:03 +00:00
Mikayla Gawarecki	383d2d1f6c	Add testing and fix issues for weights_only load for LRScheduler (#123775 ) Fixes https://github.com/pytorch/pytorch/issues/98921 There were two issues detected: - `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter` - `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or `<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg. This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler. Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-16 20:29:27 +00:00
FFFrog	791e5db705	Part 3: UFMT fix the rest files in torch/optim due to the pr-sanity-checks (#124055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124055 Approved by: https://github.com/ezyang ghstack dependencies: #124048, #124053, #124054	2024-04-16 03:22:39 +00:00
FFFrog	ac74a6783b	Part 2: UFMT fix 2 files in torch/optim due to the pr-sanity-checks (#124054 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124054 Approved by: https://github.com/ezyang ghstack dependencies: #124048, #124053	2024-04-16 03:20:21 +00:00
FFFrog	560efaa471	Part 1: UFMT partial files in torch/optim due to the pr-sanity-checks (#124053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124053 Approved by: https://github.com/ezyang ghstack dependencies: #124048	2024-04-16 03:17:18 +00:00
FFFrog	f30704f5f3	add preparatory work for torch/optim/lr_scheduler.py (#124048 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124048 Approved by: https://github.com/albanD	2024-04-16 03:17:18 +00:00
David Chiu	ab647bd325	Add missing interfaces of `torch.optim.swa_utils` (#117036 ) Add type hints for the function/class interfaces that appear in torch/optim/swa_utils.py but are missing in torch/optim/swa_utils.pyi. - get_ema_multi_avg_fn - get_swa_multi_avg_fn - get_ema_avg_fn - get_swa_avg_fn - AveragedModel.__init__(multi_avg_fn) - SWALR.get_lr Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117036 Approved by: https://github.com/janeyx99	2024-04-12 17:17:36 +00:00
Michael Lazos	2ac99d539b	Only initialize state if needed in SGD (#123757 ) Fixes [T184381726](https://www.internalfb.com/intern/tasks/?t=184381726) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123757 Approved by: https://github.com/janeyx99	2024-04-11 08:56:06 +00:00
Michael Lazos	aa16c0163f	Only update momentum buffers for SGD if momentum is enabled (#122349 ) As title [benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370) Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex. ElectraForQuestionAnswering goes from 1090us -> 554us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349 Approved by: https://github.com/janeyx99	2024-04-03 18:29:55 +00:00
Jane Xu	d7fe0603a1	Move sparse tests to TestOptimRenewed (#123146 ) This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params. 1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present. 2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for: Adagrad: ``` python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'maximize': True, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'lr': 0.1} <--- this and above are CPU .{'foreach': False, 'lr': 0.1} {'foreach': True, 'lr': 0.1} {'maximize': True, 'foreach': False, 'lr': 0.1} {'maximize': True, 'foreach': True, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1} . ---------------------------------------------------------------------- Ran 2 tests in 227.744s OK ``` SGD ``` (pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'dampening': 0.5, 'lr': 0.0048} .{'foreach': False, 'lr': 0.0048} {'foreach': True, 'lr': 0.0048} {'dampening': 0.5, 'foreach': False, 'lr': 0.0048} {'dampening': 0.5, 'foreach': True, 'lr': 0.0048} . ---------------------------------------------------------------------- Ran 2 tests in 112.801s OK ``` SparseAdam ``` (pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'maximize': True, 'lr': 0.04} .{'maximize': True, 'lr': 0.04} . ---------------------------------------------------------------------- Ran 2 tests in 35.113s OK ``` Fixes #103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed. Next steps: Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality. Add more tests! Maybe add more optimizers! Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146 Approved by: https://github.com/albanD ghstack dependencies: #123134, #123139	2024-04-02 22:51:02 +00:00
Michael Lazos	16771747c2	Add tensor step and capturable support to rprop (#122261 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes Rprop step update while compiling Also adds capturable support + testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261 Approved by: https://github.com/janeyx99	2024-03-28 23:31:18 +00:00
Michael Lazos	caa57e4fcd	Add tensor step and capturable support to rmsprop (#122264 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes RMSprop step update while compiling Adds capturable support to RMSprop Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264 Approved by: https://github.com/janeyx99	2024-03-28 03:39:28 +00:00
PyTorch MergeBot	f140309e9c	Revert "Only update momentum buffers for SGD if momentum is enabled (#122349 )" This reverts commit `a333b080c1`. Reverted https://github.com/pytorch/pytorch/pull/122349 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/122349#issuecomment-2023001467))	2024-03-27 15:04:52 +00:00
Michael Lazos	a333b080c1	Only update momentum buffers for SGD if momentum is enabled (#122349 ) As title [benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370) Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex. ElectraForQuestionAnswering goes from 1090us -> 554us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349 Approved by: https://github.com/janeyx99	2024-03-26 04:19:39 +00:00
Michael Lazos	365e89a591	Add tensor step to adadelta (#122252 ) Towards fixing https://github.com/pytorch/pytorch/issues/115679 Fixes Adadelta step update while compiling Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252 Approved by: https://github.com/janeyx99	2024-03-21 07:28:47 +00:00
Jane Xu	9d6c5be781	Add ASGD capturable API for forloop (#121264 ) @tfsingh I got to it first--wanted to land this stack and close the gap ASAP. This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always. There are some next steps though: - ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though. ¯\_(ツ)_/¯ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264 Approved by: https://github.com/albanD ghstack dependencies: #121260	2024-03-08 00:00:30 +00:00
Jane Xu	24821fec26	Add RAdam capturable API for forloop (#121260 ) Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260 Approved by: https://github.com/mlazos	2024-03-08 00:00:30 +00:00
Jane Xu	53bdae736d	Add capturable single tensor Adamax (#121183 ) Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183 Approved by: https://github.com/albanD	2024-03-07 17:57:02 +00:00

1 2 3 4 5 ...

721 Commits