Improves typing so that all the optimizer subclasses (which all of them that subtype step) do not erase their type signature when this decorator is used. Now *kwarg values and returns will propogate
This complements @tsunghsienlee PR #153367 as the type signature of step() was being erased on all the optimizer subclasses by this untyped decorator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153374
Approved by: https://github.com/janeyx99, https://github.com/tsunghsienlee
## Summary
This PR updates the docstring for `CosineAnnealingLR` to accurately reflect its recursive learning rate schedule. The previous docstring displayed only the SGDR closed-form expression, which doesn't match the actual recursive implementation in code.
Changes:
- Added the recursive update formula used in `get_lr()`
- Retained the original closed-form SGDR expression for reference
- Clarified that warm restarts are not implemented in this scheduler
This addresses confusion raised in issue #152081.
## Related issue
[#152081](https://github.com/pytorch/pytorch/issues/152081)
## Testing
Doc-only change. Ran pre-commit to verify formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152936
Approved by: https://github.com/janeyx99
Fixes#142397
Basic implementation is done. What's left:
- [x] Different dtype/device tensors in the TensorList
- [x] fast path for grouping the foreach kernel
- [x] Tests
Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device.
By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put:
`instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)`
This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
As per this [discussion](https://discuss.pytorch.org/t/a-question-about-requiredparameter/137977), I figured that `_RequiredParameter` is no longer used.
The `required` object was initially introduced in this [PR](4db6667923) as the `SGD` optimizer did not offer a default value for the learning rate. However there isn't a single place in the code base using `_RequiredParameter`, nor `required`. I am therefore removing unused `_RequiredParameter` and `required`.
Everything not included in this PR is Not a Contribution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144771
Approved by: https://github.com/janeyx99
Third PR in a series of PRs to broaden differentiable optimizer support w/ @janeyx99 (sorry for pinging over the holidays! I just wanted to put this one out but I am definitely not asking for review or anything like that rn)
This is also going to probably be my last PR before the holidays!
Note: This is a branch of #143710 -- I've never worked on a branch of a branch before so I wasn't sure about the protocol so I thought I'd just made the PR and wait until that one gets merged.
This is adding support for differentiable lr, weight_decay, and betas to Adam and AdamW (but after refactoring AdamW into an Adam subclass, it's really just changing code in torch/optim/adam.py)
I had one main thing I was wondering about, which is that adam already has a differentiable flag built in, so I have code like this
```py
if differentiable and isinstance(beta2, Tensor):
if beta2.requires_grad:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
else:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
else:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
That I could definitely simplify to just
```py
if differentiable and isinstance(beta2, Tensor):
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
else:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
It would definitely be a little slower in the case that it's differentiable but doesn't need a grad for beta2, but the code would also be a lot more clear and I'm debating speed vs future code usability.
Also the line in the above example:
```py
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
```
was concerning to me because it is considerably more expensive than `value=1 - beta2`, but I couldn't think of a better way to do it.
Further work on #141832
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143726
Approved by: https://github.com/janeyx99
(Actual) second PR in a larger project to broaden support for differentiable optimizers with @janeyx99!
In this PR, I did a lot of pattern matching from the previous PR to add support for differentiable weight_decay.
And also added a single new line on line 359 (previously line 352) to make the code from the last PR a little easier to read
Continuation of progress on #141832
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143679
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Fixes#104899
Refactors AdamW into Adam by making AdamW a subclass of Adam. Additionally adds a test to assert that the added parameter `decoupled_weight_decay` is True in AdamW and also updates test_defaults_changed_to_foreach to account for the differences in module location for AdamW.
Heavily heavily inspired by #118857 by @tfsingh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143710
Approved by: https://github.com/janeyx99
Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143510
Approved by: https://github.com/janeyx99
Over time, a large number of the existing type ignores have become irrelevant/unused/dead as a result of improvements in annotations and type checking.
Having these `# type: ignore` linger around is not ideal for two reasons:
- They are verbose/ugly syntatically.
- They could hide genuine bugs in the future, if a refactoring would actually introduce a bug but it gets hidden by the ignore.
I'm counting over 1500 unused ignores already. This is a first PR that removes some of them. Note that I haven't touched type ignores that looked "conditional" like the import challenge mentioned in https://github.com/pytorch/pytorch/pull/60006#issuecomment-2480604728. I will address these at a later point, and eventually would enable `warn_unused_ignores = True` in the mypy configuration as discussed in that comment to prevent accumulating more dead ignores going forward.
This PR should have no effect on runtime at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142325
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
According to the documentation, decay is a number in [0,1] range,[ i.e.](https://pytorch.org/docs/stable/optim.html)
```
Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to get_ema_multi_avg_fn, the default is 0.999.
```
An inspection of `swa_utils.py` indicates there are no checks for invalid values of `decay`. Adding asserts as suggested in this PR ensures valid compute range (one way to enforce correct behavior, there are perhaps more suitable ones). Papers `torch` cites for reference idea/implementation also consider exclusively this range (e.g., https://arxiv.org/pdf/2310.04415).
Fixes https://github.com/pytorch/pytorch/issues/133772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133773
Approved by: https://github.com/janeyx99