Commit Graph

721 Commits

Author SHA1 Message Date
Matt Pitkin
8a5dd7f59b Allow SequentialLR to include ChainedScheduler (#133450)
This fixes #132745 and allows a `SequentialLR` to include schedulers that are compound scheduler types (i.e., a `ChainedScheduler`), which contain a list of schedulers in a `_schedulers` attribute.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133450
Approved by: https://github.com/janeyx99
2024-10-18 02:29:38 +00:00
Daniel Velkov
4abe38bc94 RMSprop docs: add missing input "epsilon" (#137854)
Adding a missing input argument in the docs for RMSprop. Like in the doc for AdamW https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137854
Approved by: https://github.com/janeyx99
2024-10-15 16:40:42 +00:00
ErezYosef
197601eeea Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107)
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**

(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)

## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
    'state': {
    0: {'momentum_buffer': tensor(...), ...},
    1: {'momentum_buffer': tensor(...), ...},
    },
    'param_groups': [
        {
        'lr': 0.01,
        'weight_decay': 0,
        ...
        'params': [0,1]
        'param_names' ['layer.weight', 'layer.bias']  (optional)
        }
    ]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.

## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.

## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.

#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.

## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.

## Solution Example:

A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
    # assuming a single param group.
    current_state_group = optimizer.state_dict()['param_groups'][0]
    loaded_state_group = state_dict['param_groups'][0]

    # same number of params, same names, only different ordering
    current_state_name_to_id_mapping = {}  # mapping --  param_name: id
    for i, name in enumerate(current_state_group['param_names']):
        current_state_name_to_id_mapping[name] = current_state_group['params'][i]

    # changing the ids of the loaded state dict to match the order of the given state dict.
    for i, name in enumerate(current_state_group['param_names']):
        loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]

    return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.

### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-10-14 19:24:44 +00:00
Jane Xu
f9ed39c989 Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637
Approved by: https://github.com/albanD
2024-10-10 01:23:30 +00:00
Jane Xu
972822dea1 Minorly reorder optim kwargs in docs, fixes #137391 (#137531)
Closes #137391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531
Approved by: https://github.com/albanD
2024-10-09 04:14:45 +00:00
Jane Xu
ddc7b6d0b4 Removes confusing note, addresses #38006 (#137535)
Fixes #38006

The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535
Approved by: https://github.com/albanD
2024-10-09 04:00:38 +00:00
Jane Xu
b16167874d Minor SGD docs clarification fixing #137356, #137352 (#137528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528
Approved by: https://github.com/albanD
2024-10-08 23:05:08 +00:00
Sunishchal Dev
a8ed873ba2 Add missing input "eps" to adam docs (#135191)
Minor fix for missing input argument in the Adam optimizer docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191
Approved by: https://github.com/janeyx99
2024-09-25 20:17:23 +00:00
Mauricio Villegas
ece8267d2c Add back optim type hints that were lost when *.pyi files were removed (#136185)
When stub files (`*.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185
Approved by: https://github.com/janeyx99
2024-09-17 15:45:15 +00:00
Aaron Gokaslan
31715be72a [BE]: Update mypy to 1.11.2 (#133816)
Updates mypy to 1.11.1 to improve type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816
Approved by: https://github.com/ezyang
2024-09-16 19:44:11 +00:00
PyTorch MergeBot
3117f2cf67 Revert "[BE]: Update mypy to 1.11.2 (#133816)"
This reverts commit 55299cfc22.

Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))
2024-09-16 09:11:16 +00:00
Aaron Gokaslan
55299cfc22 [BE]: Update mypy to 1.11.2 (#133816)
Updates mypy to 1.11.1 to improve type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816
Approved by: https://github.com/ezyang
2024-09-14 21:40:36 +00:00
Jane Xu
b1612569f6 [BE] Clarify defaulting behavior in optimizer (#135384)
Fixes #135340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384
Approved by: https://github.com/drisspg, https://github.com/jainapurva
2024-09-06 21:52:55 +00:00
Nikita Shulga
fb1c580892 [BE][optim] Make pyright recognize exported symbols (#135043)
Follows pattern introduced by https://github.com/pytorch/pytorch/pull/80955 which [pyright](https://github.com/microsoft/pyright) prefers over `__all__` symbol, see https://github.com/microsoft/pylance-release/issues/2953#issuecomment-1168956296
Fixes https://github.com/pytorch/pytorch/issues/134985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135043
Approved by: https://github.com/janeyx99
2024-09-04 21:53:46 +00:00
Wil Kong
de06345e9b Avoid Host & Device Sync In LR Scheduler (#133663)
Fixes #133662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133663
Approved by: https://github.com/janeyx99, https://github.com/eqy

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-22 03:52:43 +00:00
Sahdev Zala
06cc2e83f0 Make optim.swa.util content accessible from the torch.optim doc (#133393)
Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc.

Currently, if you click the link,
https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`.
Also,
`torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393
Approved by: https://github.com/janeyx99
2024-08-21 00:43:46 +00:00
Masaki Kozuki
702c810780 move param's device check to _init_group for fused (#131153)
There could be some cases where the params have the meta device when calling optimizer's dunder init and those params are materialized in the first computation. This change would allow such situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131153
Approved by: https://github.com/mlazos, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-17 04:49:47 +00:00
Jane Xu
c23dceb8f1 Add Adafactor foreach impl (#132336)
This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR:
- we have a foreach flag for Adafactor
- It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency.

Next steps:
- make torch.compile possible on it
- make it faster (by adding more foreach apis)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336
Approved by: https://github.com/albanD
ghstack dependencies: #133360
2024-08-15 17:00:33 +00:00
Xuehai Pan
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
Jane Xu
14750dd737 Correct return type of grouping helper function in Optimizer (#133360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133360
Approved by: https://github.com/albanD
2024-08-14 01:56:02 +00:00
Pierre Chapuis
0e4c0ef29f fix type of eta_min parameter in CosineAnnealing (int -> float) (#132482)
This fixes errors with type checkers such as `pyright`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132482
Approved by: https://github.com/janeyx99
2024-08-12 18:22:26 +00:00
PyTorch MergeBot
cbee9c1fd2 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 0e7e61f7ce.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))
2024-08-07 00:05:20 +00:00
Xuehai Pan
0e7e61f7ce Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-08-03 09:43:38 +00:00
xinyu-intel
2ee9895304 Support optimizer capturable on hpu and xpu (#132119)
as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132119
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-08-02 08:19:52 +00:00
Xuehai Pan
30293319a8 [BE][Easy][19/19] enforce style for empty lines in import segments in torch/[o-z]*/ (#129771)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2024-08-01 17:07:14 +00:00
Joel Schlosser
e6cddc9271 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-30 18:42:54 +00:00
Jane Xu
3816f6420a [BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)
Based on the discussion here where ** 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358
Approved by: https://github.com/albanD
2024-07-30 18:08:17 +00:00
PyTorch MergeBot
8f5cf46405 Revert "Fix public API tests (#131386)"
This reverts commit 91fcfd8760.

Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))
2024-07-28 03:23:04 +00:00
Matthew Hoffman
fdf1451bfa Add __all__ to torch.optim to define public interface (#131959)
There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15)

```
error: "SGD" is not exported from module "torch.optim"
```

Adding these classes/modules to `__all__` fixes this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959
Approved by: https://github.com/ezyang
2024-07-27 01:03:25 +00:00
Joel Schlosser
91fcfd8760 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-26 23:38:43 +00:00
PyTorch MergeBot
e4db5dc1c4 Revert "[BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)"
This reverts commit 4c7f22dee2.

Reverted https://github.com/pytorch/pytorch/pull/131358 on behalf of https://github.com/janeyx99 due to Internal uses this private API and landing that has been a pain so we're reverting this first ([comment](https://github.com/pytorch/pytorch/pull/131358#issuecomment-2253190654))
2024-07-26 17:35:27 +00:00
PyTorch MergeBot
c9888c2739 Revert "[BE] typing for decorators - optim/optimizer (#131583)"
This reverts commit a1dad77dfa.

Reverted https://github.com/pytorch/pytorch/pull/131583 on behalf of https://github.com/atalman due to Breaks CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10105959146/job/27947741162) [HUD commit link](a1dad77dfa) ([comment](https://github.com/pytorch/pytorch/pull/131583#issuecomment-2252784280))
2024-07-26 13:41:22 +00:00
Aaron Orenstein
a1dad77dfa [BE] typing for decorators - optim/optimizer (#131583)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131583
Approved by: https://github.com/janeyx99
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581, #131582
2024-07-26 05:00:07 +00:00
Jane Xu
9c4cf866c2 Adafactor forloop basic impl (#129905)
#109581

At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor

Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```

The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
        [ 0.0762539,  0.5377805],
        [ 0.2459473,  0.4662207]])

# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
        [ 0.0762507,  0.5377818],
        [ 0.2459457,  0.4662213]])

# keras
array([[ 0.38076326, -0.39121315],
        [ 0.0762547 ,  0.5377859 ],
        [ 0.24594972,  0.46622536]], dtype=float32)
```

This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.

<details>

Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing

My script repro:

```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)

weight = torch.tensor([[ 0.37697506, -0.39500135],
        [ 0.07246649,  0.53399765],
        [ 0.24216151,  0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)

weight.grad = torch.tensor([[-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974,  1.5422692], dtype=torch.float32)

weight_c = weight.clone()
weight_c.grad = weight.grad.clone()

optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)

optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```

<details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
2024-07-25 13:17:19 +00:00
Jane Xu
4c7f22dee2 [BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)
Based on the discussion here where ** 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358
Approved by: https://github.com/albanD
2024-07-24 14:58:57 +00:00
hxwang
276b5238ef [bug] Add is_compiling check for optimizers to avoid untracked tensor during graph tracing (#130909)
Hey folks, I was using the `stateless_func` [here](7c45476d38/torch/distributed/_spmd/api.py (L435)), which worked well before [this commit](https://github.com/pytorch/pytorch/pull/111084) but then introduced a `_tensor_constant0` and made this func non-stateless. Since there is no way to retrieve this constant tensor before compilation and performance is not an issue when tracing a graph, I think it might be good to fall back to the other branch.
![image](https://github.com/user-attachments/assets/6ee4487d-456b-47e0-8c1d-66cb5a641d47)

![image](https://github.com/user-attachments/assets/1ed46502-e50e-45c4-9751-49aa5a4590ae)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130909
Approved by: https://github.com/mlazos
2024-07-24 08:29:27 +00:00
Aaron Orenstein
5a0068cc69 [BE] mypy: disallow untyped decorators (#131428)
Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations.

Step 1 - Enable the error and override in all the offending files.

#131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428
Approved by: https://github.com/justinchuby, https://github.com/oulgen
2024-07-23 21:50:55 +00:00
Li-Huai (Allan) Lin
99d9b369f4 [Optim] Support tensor lr for all optimizers and check it is 1-element (#131065)
Fixes: #130980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065
Approved by: https://github.com/janeyx99
2024-07-23 04:27:05 +00:00
Tianyi Tao
3477ee38e4 fix the use of initial learning rate in the OneCycleLR example (#130306)
Fixes #127649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306
Approved by: https://github.com/janeyx99
2024-07-09 18:58:07 +00:00
Li-Huai (Allan) Lin
8ec5ba960f [MPS] Add tensor_lr overloads to fused adam & adamw (#129451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451
Approved by: https://github.com/janeyx99
2024-07-02 19:46:30 +00:00
Michael Lazos
aa7ea6b45c Add wraps back (#129933)
Fixes https://github.com/pytorch/pytorch/issues/129922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129933
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-07-02 18:24:02 +00:00
Sahdev Zala
9795dba1e0 Optim package docstring fix (#129086)
Fix docstrings in various files in optim package. This is a last remaining fix for the issue #112593

The fix can be verified by running pydocstyle path-to-file --count

Fixes #112593

Related #128248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129086
Approved by: https://github.com/janeyx99
2024-06-21 14:30:53 +00:00
Li-Huai (Allan) Lin
9a7e2519d3 [MPS] Fused Adam & AdamW (#127242)
Summary:

This PR adds fused Adam and AdamW implementations.

Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory:
**Fast math enabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        89
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        90
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        83
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       12      |        94
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       11      |        88
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       12      |        90
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |       100
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       27      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       23      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       27      |       100
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       23      |        98
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       82      |       480
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       72      |       450
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       82      |       450
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       73      |       420
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       91      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       83      |       400
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |       94      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       78      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      170      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      140      |       600
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      170      |       600
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      140      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      250      |       890
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      220      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      250      |       830
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      220      |       770
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      270      |       870
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      230      |       840
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      270      |       810
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      240      |       800
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      400      |      1000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      360      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      430      |      2000
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      360      |      1300

Times are in milliseconds (ms).
```

**Fast math disabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        79
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       11      |        93
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       10      |        90
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       11      |        91
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |        81
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       34      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       31      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       34      |        95
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       31      |       100
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       94      |       500
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       82      |       430
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       92      |       430
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       81      |       390
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       98      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       88      |       430
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |      100      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       88      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      210      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      190      |       610
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      210      |       510
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      190      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      300      |       900
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      260      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      295      |       900
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      260      |       800
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      320      |       910
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      280      |       900
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      320      |       900
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      300      |       900
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      500      |      2000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      480      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      540      |      1500
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      480      |      1200

Times are in milliseconds (ms).
```

```python
def profile_fused_adam():
    from torch.optim import adam, adamw
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused):
        fn(
            params,
            grads,
            exp_avgs,
            exp_avg_sqs,
            max_exp_avg_sqs,
            state_steps,
            foreach=False,
            capturable=False,
            fused=fused,
            amsgrad=amsgrad,
            beta1=0.9,
            beta2=0.99,
            lr=1e-3,
            weight_decay=.0,
            eps=1e-5,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]):
        print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}")
        params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)]
        max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else []
        state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)]
        if adamWflag:
            fn = adamw.adamw
        else:
            fn = adam.adam

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)',
                    label='Fused Adam',
                    sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}",
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242
Approved by: https://github.com/kulinseth, https://github.com/janeyx99
2024-06-18 19:59:50 +00:00
Ahmed Gheith
3dd5f0ecbb Remove circular import (#128875)
Summary: A spurious import is causing circular dependency errors

Test Plan: phabricator signals

Differential Revision: D58685676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128875
Approved by: https://github.com/kit1980
2024-06-18 12:30:13 +00:00
Sahdev Zala
4ccbf711e2 Learning Rate Scheduler docstring fix (#128679)
Fix docstrings in Learning Rate Scheduler.

The fix can be verified by running pydocstyle path-to-file --count

Related #112593

**BEFORE the PR:**
pydocstyle torch/optim/lr_scheduler.py --count

92


**AFTER the PR:**
pydocstyle torch/optim/lr_scheduler.py --count

0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679
Approved by: https://github.com/janeyx99
2024-06-15 05:30:35 +00:00
Xuehai Pan
dcc0093dba [BE][Easy] export explicitly imported public submodules (#127703)
Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703
Approved by: https://github.com/ezyang
2024-06-12 05:52:18 +00:00
PyTorch MergeBot
90bb510ece Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 348b181a97.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))
2024-06-10 20:44:42 +00:00
Aaron Orenstein
27f9d3b0a1 Flip default value for mypy disallow_untyped_defs [8/11] (#127845)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127845
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843, #127844
2024-06-08 18:49:56 +00:00
Xuehai Pan
348b181a97 Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007
2024-06-08 15:25:03 +00:00
joncrall
80fa2778ed Update types for verbose in lr_scheduler (#127943)
I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool | str`.

This PR adds a `| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943
Approved by: https://github.com/janeyx99
2024-06-06 21:59:22 +00:00
Adam J. Stewart
80d34217c6 Typo fixes: et al. (#127811)
"et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811
Approved by: https://github.com/janeyx99
2024-06-06 01:03:25 +00:00
GdoongMathew
3437177e2b Quick Fix on #126854, deepcopy lr and other possible base_parameters (#127190)
* Apply `deepcopy` to every base parameters (`initial_lr`, `max_lr`) when instantiating `LRScheduler`.

Fixes #126854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127190
Approved by: https://github.com/janeyx99
2024-06-03 18:06:31 +00:00
PyTorch MergeBot
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
feifan
da9fb670d2 Nadam support the flag for "maximize" (#127214)
Fixes https://github.com/pytorch/pytorch/issues/126642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127214
Approved by: https://github.com/janeyx99
2024-05-31 01:11:16 +00:00
Xuehai Pan
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
feifan
22712ba5c5 Radam support the flag for "maximize" (#126765)
Fixes #[126642](https://github.com/pytorch/pytorch/issues/126642)

I reference the maximize in `Adam` and add `Radam's` maximize flag. If this pr is OK, I will add another pr for `Nadam`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126765
Approved by: https://github.com/janeyx99
2024-05-27 06:34:50 +00:00
Xuehai Pan
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
Jane Xu
665637714f Remove SparseAdam weird allowance of raw Tensor input (#127081)
This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used.

------

# BC Breaking note

As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts"

```
import torch
param = torch.rand(16, 32)
optimizer = torch.optim.SparseAdam(param)
```

Instead you should replace the last line with
```
optimizer = torch.optim.SparseAdam([param])
```
to no longer error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081
Approved by: https://github.com/soulitzer
2024-05-25 02:58:24 +00:00
Aart Bik
ff82e2e7cf [traced-graph][sparse] propagate sparsity metadata into traced graph (#117907)
Propagate sparsity metadata from sparse tensors of torch.sparse into the traced graph representation (with would be useful for a JIT backend that supports a "sparse compiler"). This is a first careful attempt, since the actual "meta" feature seem still incomplete for coo and completely lacking for csr/csc/bsr/bsc.

For background see forum postings (with examples):
  https://discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/195145
  https://dev-discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/1803

And feature request:
  https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117907
Approved by: https://github.com/pearu, https://github.com/ezyang
2024-05-23 22:46:46 +00:00
David Chiu
7e166e8057 [optim] Fix: wrong ASGD implementation (#126375)
This PR is based on #125440, additionally merging the latest main branch and fixing the lint failures from #126361.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126375
Approved by: https://github.com/janeyx99
2024-05-17 15:46:39 +00:00
PyTorch MergeBot
e3c5d1b7d7 Revert "[optim] Fix: wrong ASGD implementation (#125440)"
This reverts commit 2c5ad9a3d7.

Reverted https://github.com/pytorch/pytorch/pull/125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](https://github.com/pytorch/pytorch/pull/125440#issuecomment-2113833108))
2024-05-16 02:12:29 +00:00
haozhe.zhu
f9d107af66 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-16 01:11:51 +00:00
David Chiu
2c5ad9a3d7 [optim] Fix: wrong ASGD implementation (#125440)
> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor.

- [X] Ill assumption that every param will have the same step.
- [x] DIfferent implementation between `foreach=Ture` and `foreach=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440
Approved by: https://github.com/janeyx99
2024-05-15 22:52:15 +00:00
David Chiu
1a28f731dc [optim] Merge the pyi files into py files of optimizer (#125452)
Continue the work of pytorch/pytorch#125153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125452
Approved by: https://github.com/janeyx99
2024-05-14 18:24:50 +00:00
David Chiu
9641a8db25 [optim] deprecate LRScheduler.print_lr (#126105)
Fixes #99270

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126105
Approved by: https://github.com/janeyx99
2024-05-14 14:13:03 +00:00
daitian1995
b805d3cbcb Modify device check in capturable optimizer to support more devices (#124919)
Fixes #124830

Modify device check in capturable optimizer to support more device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124919
Approved by: https://github.com/janeyx99
2024-05-14 05:56:00 +00:00
PyTorch MergeBot
bd3cbdba2f Revert "[optim] add fused_adagrad support for CPU device (#124905)"
This reverts commit 1c3fe84033.

Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk 1c3fe84033 ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))
2024-05-13 20:53:22 +00:00
haozhe.zhu
1c3fe84033 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-13 01:16:20 +00:00
Michael Lazos
b833fc0ecb Tighten fallback conditions for compiled optim (#125825)
Since we now will support `capturable=False` when it's valid, narrow the eager fallback conditions to the cases where `compile` will fail. The lone case here is when the user deletes the capturable flag; `state_steps` are on cuda and `capturable` is `False`. Because a cuda tensor is not supported in the `value` kwarg for foreach ops this results in an error.

The fallback wrapper is changed to check the device of `state_steps` if `capturable=False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125825
Approved by: https://github.com/janeyx99
2024-05-11 06:29:51 +00:00
David Chiu
31946c10d0 Add missing parameter doc of Adagrad (#125886)
Add the missing documentation for `initial_accumulator_value` parameter in Adagrad, and update the algorithm description in the documentation (adjusted to reflect the implementation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125886
Approved by: https://github.com/janeyx99
2024-05-10 22:55:22 +00:00
David Chiu
c520929c83 add typing in torch.optim.lr_scheduler (#125556)
Merge torch/optim/lr_scheduler.pyi into torch/optim/lr_scheduler.py
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125556
Approved by: https://github.com/janeyx99
2024-05-10 19:28:00 +00:00
Michael Lazos
69eeef0727 Update LRScheduler to handle tensor LR (#123753)
Enables LRScheduler to handle tensor LRs.

Note on test changes:
For the test modifications I just removed itertools.product and created two loops. This allows us to create a new set of optim_inputs on each iteration to prevent mutations on the tensor LR carrying over across iterations. Nothing else in those tests was modified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123753
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751, #123752
2024-05-09 00:52:43 +00:00
Michael Lazos
7b36b4a765 Fix user warning for tensor LR (#123752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123752
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751
2024-05-09 00:52:43 +00:00
Michael Lazos
0ea6ffc613 Swap warning counter to flag in LRScheduler (#123751)
This was a counter previously, this should be a flag to indicate whether or not the optimizer step has been called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123751
Approved by: https://github.com/janeyx99
2024-05-09 00:52:43 +00:00
Michael Lazos
0f02e0aa39 Disable dynamo on functional optims if capturable=False (#123619)
This resolves a bug in eager where if an old state dict is loaded (without the capturable flag) but the original dict had the capturable flag, then state_steps would be on cuda but we would take the non-capturable path. We now fallback to eager if capturable=False.

Current design doc and discussion: https://docs.google.com/document/d/1DmmbiaSp16CDZtGw1qzXKHFTY_0gqc0xpnBdviXq0vk/edit#heading=h.871u7bvwz7ze

Note on the actual fallback logic - there was an issue with torchscript originally not handling *args, **kwargs properly, after rectifying that by using `functools.wraps`, there was an additional bug with scoping which required the single tensor implementation to be in the global scope at the time of the fallback closure being created. I pass in the single tensor function to the `_disable_dynamo_if_unsupported` decorator to workaround this bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123619
Approved by: https://github.com/janeyx99
2024-05-07 22:17:01 +00:00
David Chiu
a60fa960e5 refactor: extract get_lr warning (#125545)
Extract the `_get_lr_called_within_step` checking in the `get_lr()` of every LRSchedulers.
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125545
Approved by: https://github.com/janeyx99
2024-05-07 03:15:58 +00:00
David Chiu
b1b03992d0 Merge the pyi files into py files of optimizer (#125153)
Merge the interfaces in pyi files into py files in `torch/optim`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125153
Approved by: https://github.com/janeyx99
2024-05-02 21:29:31 +00:00
Michael Lazos
787afc5180 Add LR as tensor tests (#123750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750
Approved by: https://github.com/janeyx99
2024-05-01 04:46:49 +00:00
Alex Morehead
9aed5dcfe6 Clarify wording in docstring for CosineAnnealingWarmRestarts within lr_scheduler.py (#125161)
- Clarifies wording in the docstring for `CosineAnnealingWarmRestarts` within `lr_scheduler.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125161
Approved by: https://github.com/janeyx99
2024-04-30 14:01:22 +00:00
haozhe.zhu
3c964ad1ca add fused_sgd_kernel support for CPU device (#123629)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9
https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_sgd time: 0.2301 seconds
_fused_sgd time: 0.0925 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_sgd time: 2.6195 seconds
_fused_sgd time: 1.7543 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```
Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-23 08:28:19 +00:00
GdoongMathew
8b1ad51881 Better Error Message in ChainedScheduler and SequentialLR (#121633)
Fixes #121577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633
Approved by: https://github.com/janeyx99
2024-04-19 13:37:41 +00:00
Jane Xu
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
Michael Lazos
57a3dc56d4 Small Adamax fix (#123498)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123498
Approved by: https://github.com/janeyx99
2024-04-18 00:50:03 +00:00
Mikayla Gawarecki
383d2d1f6c Add testing and fix issues for weights_only load for LRScheduler (#123775)
Fixes https://github.com/pytorch/pytorch/issues/98921

There were two issues detected:
- `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter`
- `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or
`<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg.
   This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler.

  Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-16 20:29:27 +00:00
FFFrog
791e5db705 Part 3: UFMT fix the rest files in torch/optim due to the pr-sanity-checks (#124055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124055
Approved by: https://github.com/ezyang
ghstack dependencies: #124048, #124053, #124054
2024-04-16 03:22:39 +00:00
FFFrog
ac74a6783b Part 2: UFMT fix 2 files in torch/optim due to the pr-sanity-checks (#124054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124054
Approved by: https://github.com/ezyang
ghstack dependencies: #124048, #124053
2024-04-16 03:20:21 +00:00
FFFrog
560efaa471 Part 1: UFMT partial files in torch/optim due to the pr-sanity-checks (#124053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124053
Approved by: https://github.com/ezyang
ghstack dependencies: #124048
2024-04-16 03:17:18 +00:00
FFFrog
f30704f5f3 add preparatory work for torch/optim/lr_scheduler.py (#124048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124048
Approved by: https://github.com/albanD
2024-04-16 03:17:18 +00:00
David Chiu
ab647bd325 Add missing interfaces of torch.optim.swa_utils (#117036)
Add type hints for the function/class interfaces that appear in torch/optim/swa_utils.py but are missing in torch/optim/swa_utils.pyi.

- get_ema_multi_avg_fn
- get_swa_multi_avg_fn
- get_ema_avg_fn
- get_swa_avg_fn
- AveragedModel.__init__(multi_avg_fn)
- SWALR.get_lr

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117036
Approved by: https://github.com/janeyx99
2024-04-12 17:17:36 +00:00
Michael Lazos
2ac99d539b Only initialize state if needed in SGD (#123757)
Fixes [T184381726](https://www.internalfb.com/intern/tasks/?t=184381726)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123757
Approved by: https://github.com/janeyx99
2024-04-11 08:56:06 +00:00
Michael Lazos
aa16c0163f Only update momentum buffers for SGD if momentum is enabled (#122349)
As title

[benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370)

Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex.
ElectraForQuestionAnswering goes from 1090us -> 554us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349
Approved by: https://github.com/janeyx99
2024-04-03 18:29:55 +00:00
Jane Xu
d7fe0603a1 Move sparse tests to TestOptimRenewed (#123146)
This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params.

1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present.

2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for:

Adagrad:
```
python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'lr': 0.1}    <--- this and above are CPU
.{'foreach': False, 'lr': 0.1}
{'foreach': True, 'lr': 0.1}
{'maximize': True, 'foreach': False, 'lr': 0.1}
{'maximize': True, 'foreach': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1}
.
----------------------------------------------------------------------
Ran 2 tests in 227.744s

OK
```

SGD
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'dampening': 0.5, 'lr': 0.0048}
.{'foreach': False, 'lr': 0.0048}
{'foreach': True, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': False, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': True, 'lr': 0.0048}
.
----------------------------------------------------------------------
Ran 2 tests in 112.801s

OK
```

SparseAdam
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.04}
.{'maximize': True, 'lr': 0.04}
.
----------------------------------------------------------------------
Ran 2 tests in 35.113s

OK
```

Fixes #103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed.

Next steps:
Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality.

Add more tests!

Maybe add more optimizers!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146
Approved by: https://github.com/albanD
ghstack dependencies: #123134, #123139
2024-04-02 22:51:02 +00:00
Michael Lazos
16771747c2 Add tensor step and capturable support to rprop (#122261)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Rprop step update while compiling

Also adds capturable support + testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261
Approved by: https://github.com/janeyx99
2024-03-28 23:31:18 +00:00
Michael Lazos
caa57e4fcd Add tensor step and capturable support to rmsprop (#122264)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes RMSprop step update while compiling

Adds capturable support to RMSprop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264
Approved by: https://github.com/janeyx99
2024-03-28 03:39:28 +00:00
PyTorch MergeBot
f140309e9c Revert "Only update momentum buffers for SGD if momentum is enabled (#122349)"
This reverts commit a333b080c1.

Reverted https://github.com/pytorch/pytorch/pull/122349 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/122349#issuecomment-2023001467))
2024-03-27 15:04:52 +00:00
Michael Lazos
a333b080c1 Only update momentum buffers for SGD if momentum is enabled (#122349)
As title

[benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370)

Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex.
ElectraForQuestionAnswering goes from 1090us -> 554us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349
Approved by: https://github.com/janeyx99
2024-03-26 04:19:39 +00:00
Michael Lazos
365e89a591 Add tensor step to adadelta (#122252)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Adadelta step update while compiling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252
Approved by: https://github.com/janeyx99
2024-03-21 07:28:47 +00:00
Jane Xu
9d6c5be781 Add ASGD capturable API for forloop (#121264)
@tfsingh I got to it first--wanted to land this stack and close the gap ASAP.

This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always.

There are some next steps though:
- ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though.  ¯\_(ツ)_/¯

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264
Approved by: https://github.com/albanD
ghstack dependencies: #121260
2024-03-08 00:00:30 +00:00
Jane Xu
24821fec26 Add RAdam capturable API for forloop (#121260)
Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260
Approved by: https://github.com/mlazos
2024-03-08 00:00:30 +00:00
Jane Xu
53bdae736d Add capturable single tensor Adamax (#121183)
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.

Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
2024-03-07 17:57:02 +00:00