Commit Graph

117 Commits

Author SHA1 Message Date
Adnios
a9c7d626e1 Add the maximize flag to AdamW (#70146)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/68052

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70146

Reviewed By: malfet

Differential Revision: D33254561

Pulled By: albanD

fbshipit-source-id: f190c836a4162f936c5953e076747c345df21421
2021-12-23 09:20:29 -08:00
Ilqar Ramazanli
5b21f172a4 [doc][hackathon] To add AdamW Optimizer to the documentation (#63252)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of AdamW Algorithm to the documentation.  For more details, we refer to the paper  here https://arxiv.org/abs/1711.05101

<img width="442" alt="AdamWalgo" src="https://user-images.githubusercontent.com/73658284/132589957-6d381e96-cb62-40d0-990f-82a32ec455be.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63252

Reviewed By: datumbox

Differential Revision: D30839685

Pulled By: iramazanli

fbshipit-source-id: 1a426c874ab86408d286a34f41aefcf5b21167c0
2021-09-09 07:05:31 -07:00
Wanchao Liang
4611387608 [optim] take kw-only argument for functional optim APIs (#56185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56185

ghstack-source-id: 126670123

Reviewed By: albanD

Differential Revision: D27802169

fbshipit-source-id: f5e1cb2046dcdeecf5f6b0f70892828bf0adb22f
2021-04-15 20:08:04 -07:00
Wanchao Liang
f8238d7917 [optim] bugfix when all parameters have no grad (#52944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52944

This fix the bug introduced during refactoring optimizers https://github.com/pytorch/pytorch/pull/50411. When all parameters have no grads, we should still allows `beta` like hyper params to be defined.

Reviewed By: ngimel

Differential Revision: D26699827

fbshipit-source-id: 8a7074127704c7a4a1fbc17d48a81e23a649f280
2021-03-03 11:56:09 -08:00
Vincent Quenneville-Belair
50d903f19f [optim] make functional api be private (#51316) (#51665)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51665

This reverts commit 896f82aa92.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D26232608

Pulled By: vincentqb

fbshipit-source-id: ca006baf4fb672c11c1bb003c39a29cbadb63dd3
2021-02-03 17:59:05 -08:00
Vincent Quenneville-Belair
896f82aa92 [optim] make functional api be private (#51316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51316

Make optim functional API be private until we release with beta

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26213469

fbshipit-source-id: b0fd001a8362ec1c152250bcd57c7205ed893107
2021-02-03 09:29:33 -08:00
Wanchao Liang
df96344968 [optimizer] refactor AdamW to use functional API (#50411)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50411

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D25932776

Pulled By: wanchaol

fbshipit-source-id: e8e1696b3390ba7909b36fd0107c58b892520432
2021-01-21 11:00:45 -08:00
Samuel Marks
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
Xiang Gao
6bc77f4d35 Use amax/maximum instead of max in optimizers (#43797)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43797

Reviewed By: malfet

Differential Revision: D23406641

Pulled By: mruberry

fbshipit-source-id: 0cd075124aa6533b21375fe2c90c44a5d05ad6e6
2020-09-15 10:39:40 -07:00
Masaki Kozuki
7403545518 Fix exception message of torch.optim.AdamW. (#36088)
Summary:
PyTorch does not implement `SparseAdamW`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36088

Differential Revision: D20932357

Pulled By: gchanan

fbshipit-source-id: 49e5b72c34ff8ce0deb6b3807662b8b7d67d959f
2020-04-09 08:02:10 -07:00
albanD
6e2bb1c054 End of the .data removal in torch/optim (#34211)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34211

Test Plan: Imported from OSS

Differential Revision: D20248684

Pulled By: albanD

fbshipit-source-id: 2294bfa41b82ff47f000bc98860780f59d7d4421
2020-03-09 06:40:39 -07:00
Eleanor Dwight Holland
6a97777f72 Remove use of .data from optimizers (#33640)
Summary:
Removes all uses of `.data` from optimizers.

Or tries to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33640

Reviewed By: vincentqb

Differential Revision: D20203216

Pulled By: albanD

fbshipit-source-id: 9bfe78bbed00fd4aaa690801cff0201f0bd680a0
2020-03-03 13:21:55 -08:00
Xiao Wang
c1dd70688a Fix deprecated python "add" calls (#33428)
Summary:
This PR fixed those python "add" calls using deprecated signature `add(Scalar, Tensor)`. The alternative signature `add(Tensor, alpha = Scalar)` is used.

cc csarofeen zasdfgbnm ptrblck ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33428

Differential Revision: D20002534

Pulled By: vincentqb

fbshipit-source-id: 81f2dd6170a47a9b53a17e5817c26e70d8afa130
2020-02-26 09:02:31 -08:00
Nikolay Novik
d19a50bf27 Add missing weight_decay parameter validation for Adam and AdamW (#33126)
Summary:
Adam and AdamW are missing parameter validation for weight_decay. Other optimisers have this check present.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33126

Differential Revision: D19860366

Pulled By: vincentqb

fbshipit-source-id: 286d7dc90e2f4ccf6540638286d2fe17939648fc
2020-02-20 11:11:51 -08:00
Vitaly Fedyunin
877c96cddf explicitly provide memory format when calling to *_like operators
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30008

Test Plan: Imported from OSS

Differential Revision: D18575981

Pulled By: VitalyFedyunin

fbshipit-source-id: ec3418257089ad57913932be1a8608cd20ce054c
2019-11-19 16:19:29 -08:00
Farhad Ramezanghorbani
fed5ca192c Adam/AdamW implementation minor fix (#22628)
Summary:
I have noticed a small discrepancy between theory and the implementation of AdamW and in general Adam. The epsilon in the denominator of the following Adam update should not be scaled by the bias correction [(Algorithm 2, L9-12)](https://arxiv.org/pdf/1711.05101.pdf). Only the running average of the gradient (_m_) and squared gradients (_v_) should be scaled by their corresponding bias corrections.

![adam_update](https://user-images.githubusercontent.com/13050245/60894105-11117f00-a230-11e9-9ba0-adad2ae2e0ae.png)

In the current implementation, the epsilon is scaled by the square root of `bias_correction2`.  I have plotted this ratio as a function of step given `beta2 = 0.999` and `eps = 1e-8`. In the early steps of optimization, this ratio slightly deviates from theory (denoted by the horizontal red line).

![plot](https://user-images.githubusercontent.com/13050245/60893952-cabc2000-a22f-11e9-8dc2-6353ad5d674d.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22628

Differential Revision: D16589914

Pulled By: vincentqb

fbshipit-source-id: 8791eb338236faea9457c0845ccfdba700e5f1e7
2019-08-01 11:42:04 -07:00
Michael Acar
a4b2f3e213 Implement AdamW optimizer (#21250)
Summary:
# What is this?
This is an implementation of the AdamW optimizer as implemented in [the fastai library](803894051b/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training.

There have already been several abortive attempts to push this into pytorch in some form or fashion: https://github.com/pytorch/pytorch/pull/17468, https://github.com/pytorch/pytorch/pull/10866, https://github.com/pytorch/pytorch/pull/3740, https://github.com/pytorch/pytorch/pull/4429. Hopefully this one goes through.
# Why is this important?
Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have.
# How was this tested?
There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21250

Differential Revision: D16060339

Pulled By: vincentqb

fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709
2019-07-02 09:09:10 -07:00