Commit Graph

549 Commits

Author SHA1 Message Date
shibo19
b088ff4677 add foreach support for custom device (#102047)
Fixes #ISSUE_NUMBER
for custom device, we want to support foreach, so I add a func that we could set other device type, and the default value is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102047
Approved by: https://github.com/janeyx99
2023-06-01 06:22:44 +00:00
vfdev
76af22103b Fixed type hints for CosineAnnealingWarmRestarts (#102067)
Fixed type hints for CosineAnnealingWarmRestarts:
- `T_mult` is not `Optional[int]` but just `int`
- `eta_min` is not `Optional[float]` but just `float`
- removed `step` method specific annotation as it is compatible with the base class

e132f09e88/torch/optim/lr_scheduler.py (L1365-L1375)

Otherwise, computation like this `self.T_i * self.T_mult` in `self.step` is not possible:
```
error: Unsupported operand types for * ("int" and "None")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102067
Approved by: https://github.com/janeyx99
2023-05-23 19:06:07 +00:00
Jane Xu
3135bec4a0 [docs] Clarify when to use SparseAdam (#101465)
![image](https://github.com/pytorch/pytorch/assets/31798555/ff19a522-2630-4578-bc0e-6a704aa94d4e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101465
Approved by: https://github.com/albanD
2023-05-17 21:16:20 +00:00
Jane Xu
f558af2a55 [adam] Use the right params in weight_decay, rename for clarity, fixes #100707 (#100973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100973
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-05-09 17:00:27 +00:00
milesial
45bf3f6216 Optimized EMA implementation (#94820)
This PR proposes an optimized way to do Exponential Moving Average (EMA), which is faster than the current way using `swa_utils.AveragedModel` described in https://pytorch.org/docs/stable/optim.html#custom-averaging-strategies.

This implementation is asynchronous, and is built as an optimizer wrapper so that the EMA weight update happens without any additional CPU/GPU sync, just after optimizer steps, and with limited code changes.

Example usage:
```
model = Model().to(device)
opt = torch.optim.Adam(model.parameters())

opt = EMAOptimizer(opt, device, 0.9999)

for epoch in range(epochs):
    training_loop(model, opt)

    regular_eval_accuracy = evaluate(model)

    with opt.swap_ema_weights():
        ema_eval_accuracy = evaluate(model)
```

Here are some benchmarks (time per iteration) on various torchvision models:

|model|this PR iteration time                      |swa_utils.AveragedModel iteration time| iteration speedup                                      |
|-----|-----------------------------|-----------------------|---------------------------------------------|
|     |                             |                       |                                             |
|regnet_x_1_6gf|62.73                        |67.998                 |1.08                                         |
|regnet_x_3_2gf|101.75                       |109.422                |1.08                                         |
|regnet_x_400mf|25.13                        |32.005                 |1.27                                         |
|regnet_x_800mf|33.01                        |37.466                 |1.13                                         |
|regnet_x_8gf|128.13                       |134.868                |1.05                                         |
|regnet_y_16gf|252.91                       |261.292                |1.03                                         |
|regnet_y_1_6gf|72.14                        |84.22                  |1.17                                         |
|regnet_y_3_2gf|99.99                        |109.296                |1.09                                         |
|regnet_y_400mf|29.53                        |36.506                 |1.24                                         |
|regnet_y_800mf|37.82                        |43.634                 |1.15                                         |
|regnet_y_8gf|196.63                       |203.317                |1.03                                         |
|resnet101|128.80                       |137.434                |1.07                                         |
|resnet152|182.85                       |196.498                |1.07                                         |
|resnet18|29.06                        |29.975                 |1.03                                         |
|resnet34|50.73                        |53.443                 |1.05                                         |
|resnet50|76.88                        |80.602                 |1.05                                         |
|resnext101_32x8d|277.29                       |280.759                |1.01                                         |
|resnext101_64x4d|269.56                       |281.052                |1.04                                         |
|resnext50_32x4d|100.73                       |101.102                |1.00                                         |
|shufflenet_v2_x0_5|10.56                        |15.419                 |1.46                                         |
|shufflenet_v2_x1_0|13.11                        |18.525                 |1.41                                         |
|shufflenet_v2_x1_5|18.05                        |23.132                 |1.28                                         |
|shufflenet_v2_x2_0|25.04                        |30.008                 |1.20                                         |
|squeezenet1_1|14.26                        |14.325                 |1.00                                         |
|swin_b|264.52                       |274.613                |1.04                                         |
|swin_s|180.66                       |188.914                |1.05                                         |
|swin_t|108.62                       |112.632                |1.04                                         |
|swin_v2_s|220.29                       |231.153                |1.05                                         |
|swin_v2_t|127.27                       |133.586                |1.05                                         |
|vgg11|95.52                        |103.714                |1.09                                         |
|vgg11_bn|106.49                       |120.711                |1.13                                         |
|vgg13|132.94                       |147.063                |1.11                                         |
|vgg13_bn|149.73                       |165.256                |1.10                                         |
|vgg16|158.19                       |172.865                |1.09                                         |
|vgg16_bn|177.04                       |192.888                |1.09                                         |
|vgg19|184.76                       |194.194                |1.05                                         |
|vgg19_bn|203.30                       |213.334                |1.05                                         |
|vit_b_16|217.31                       |219.748                |1.01                                         |
|vit_b_32|69.47                        |75.692                 |1.09                                         |
|vit_l_32|223.20                       |258.487                |1.16                                         |
|wide_resnet101_2|267.38                       |279.836                |1.05                                         |
|wide_resnet50_2|145.06                       |154.918                |1.07                                         |

You can see that in all cases it is faster than using `AveragedModel`. In fact in many cases, adding EMA does not add any overhead since the computation is hidden behind the usual iteration flow.

This is a similar implementation to the one currently in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).

If the team is interested in merging this, let me know and I'll add some documentation similar to `swa_utils` and tests.

Credits to @szmigacz for the implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94820
Approved by: https://github.com/janeyx99
2023-04-26 18:02:11 +00:00
Arthur
7a8d0ccddf Correct LBFGS tolerance_grad doc string (#99792)
LBFGS' `tolerance_grad` parameter has had a default value of `1e-7` since #25240. The doc string wasn't updated in that PR to match the change https://github.com/pytorch/pytorch/blob/main/torch/optim/lbfgs.py#L207.

no open issue for it, just happened to set it to 1e-7 and was surprised my results didn't change :-) eventually noticed inconsistency in the doc and seemed like an easy opportunity to figure out how to contribute.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99792
Approved by: https://github.com/janeyx99
2023-04-22 20:19:01 +00:00
PyTorch MergeBot
4637c5ae5b Revert "Simplify _use_grad_for_differentiable (#98706)"
This reverts commit b9da79d280.

Reverted https://github.com/pytorch/pytorch/pull/98706 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but a bunch of inductor tests are failing after this commit, so reverting the PR just to be sure
2023-04-22 00:35:56 +00:00
Jason Ansel
b9da79d280 Simplify _use_grad_for_differentiable (#98706)
This makes it so dynamo can trace through it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98706
Approved by: https://github.com/janeyx99
2023-04-21 20:47:19 +00:00
Masaki Kozuki
22ea21da3d Change 1D Tensor of 1 element to 0D Tensor (#96994)
add 0d tensor to graph adam/adamw test

Affected:
- `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker`
- `step` of Adam & AdamW of `capturable`

Fixes #96776 🤞

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994
Approved by: https://github.com/janeyx99
2023-03-21 18:24:19 +00:00
Jane Xu
aacbf091db Allow fused optimizers to call _foreach_zero_ in zero_grad (#97159)
Fixes #97032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97159
Approved by: https://github.com/Skylion007
2023-03-20 19:03:26 +00:00
Aaron Gokaslan
5471621497 [BE] Remove unnecessary dict comprehensions (#97116)
Removes unnecessary dict comprehensions that optimize creation of dicts from iterables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116
Approved by: https://github.com/kit1980
2023-03-20 00:56:57 +00:00
Aaron Gokaslan
dd9ade6377 Remove unnecessary items() call in zero_grad (#97040)
Micro-optimization to zero_grad() which is performance critical
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97040
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-03-17 21:34:14 +00:00
David
e8b0f504e2 Fix unpicklable object in AveragedModel (#95979)
Fixes #95376

Don't store the callable `avg_fn`, instead test if `avg_fn` is None and call
the default impl if it's not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95979
Approved by: https://github.com/janeyx99
2023-03-12 05:13:22 +00:00
Masaki Kozuki
7d765cdc66 Fix wrong handling of grad_scale & found_inf in fused optimizers (#95847)
Fixes #95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.

Related #94060

I forgot about this wrong handling after #94344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95847
Approved by: https://github.com/janeyx99
2023-03-04 01:21:21 +00:00
Jane Xu
75cb99e549 [optim] Widen the cases for defaulting to foreach (#95820)
Big OOP correction continued. Also added a test this time to verify the defaulting was as expected.

The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95820
Approved by: https://github.com/albanD
2023-03-02 04:15:33 +00:00
Jane Xu
2bcf863fad [optim] include nn.Parameter as foreach supported (#95811)
This PR is a result of a realization that models are NOT subscribed to the foreach defaulting as have been claimed on our documentation for months now. BIG OOPS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95811
Approved by: https://github.com/albanD
2023-03-02 04:15:33 +00:00
Xuehai Pan
1fd119948e [3/3] Update .pyi Python stub files and enable 'UFMT' linter (#95268)
Changes:

- #95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- #95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- => this PR: #95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95268
Approved by: https://github.com/huydhn
2023-03-01 23:50:56 +00:00
Kiersten Stokes
60a1d29585 Correct OneCycleLR doc example code to explicitly call optimizer.step() (#95730)
Fixes #89358 as suggested in the issue comment

A screenshot of the example code in the built docs:
<img width="1168" alt="Screenshot 2023-02-28 at 4 46 45 PM" src="https://user-images.githubusercontent.com/31816267/221999156-02b28f2a-85b3-4aa8-841d-e4c66a39a33f.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95730
Approved by: https://github.com/janeyx99
2023-03-01 02:15:50 +00:00
Jane Xu
e5b9d98752 Rephrase zero_grad docs (#95643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95643
Approved by: https://github.com/albanD
2023-02-28 22:04:23 +00:00
Jane Xu
097679478e [optim] Set defaults to foreach, NOT fused (#95241)
Rolling back the default change for Adam and rectifying the docs to reflect that AdamW never defaulted to fused.

Since our fused implementations are relatively newer, let's give them a longer bake-in time before flipping the switch for every user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95241
Approved by: https://github.com/ngimel
2023-02-22 04:47:32 +00:00
Masaki Kozuki
3e9df622fb [mta] implement _foreach_pow (#92303)
Mainly for foreach path of `Adam` and `AdamW`

rel: https://github.com/pytorch/pytorch/issues/58833
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92303
Approved by: https://github.com/albanD
2023-02-16 02:28:26 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Xuehai Pan
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
Aaron Gokaslan
1e2d82b8e4 [BE] Merge isinstance calls together (#94419)
Simplify and speeds up isinstance calls by checking for multiple types at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419
Approved by: https://github.com/ezyang
2023-02-09 00:47:26 +00:00
Aaron Gokaslan
3ce1ebb6fb Apply some safe comprehension optimizations (#94323)
Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323
Approved by: https://github.com/albanD
2023-02-07 23:53:46 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
Masaki Kozuki
6ba041fcae Look up group["capturable"], not defaults["capturable"] in Adam(W) (#94149)
We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g.  https://github.com/pytorch/pytorch/issues/89987.

So check whether or not `capturable` is `True` among all the `param_group`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149
Approved by: https://github.com/albanD
2023-02-07 00:24:35 +00:00
Masaki Kozuki
a23ed38f9a [mta][foreach] Implement fused adamw (#88015)
related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-02-01 19:32:29 +00:00
Masaki Kozuki
d7a3f2128f pass None instead of False inside Adam.__setstate__ (#93289)
with a061f139dc, `fused`'s type hint is `Optional[bool]` and its default value is `None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93289
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2023-01-31 09:41:35 +00:00
Jane Xu
4fc19e1a71 [optim][adam] use fastest impl whenever possible, add util (#93184)
This allows it so that ONLY when the users don't set anything for foreach or fused do we switch the default and cascades adam so that we default to fused, then foreach, then single-tensor.

To clarify:
* if the user puts True in foreach _only_, it will run the foreach implementation.
* if the user puts True in fused _only_, it will run the fused implementation.
* if the user puts True in foreach AND for fused, it will run the fused implementation.

And:
* if the user puts False in foreach _only_, it will run the single tensor implementation.
* if the user puts False in fused _only_, it will still run the single tensor implementation.
* if the user puts False in foreach AND for fused, it will run the single tensor implementation.

I also didn't trust myself that much with the helper function, so I ran some local asserts on _default_to_fused_or_foreach. The only point left to really test is the type(p) -- torch.Tensor but I think the distributed tests will catch that in CI.
```
cuda_only_fp_list = [
    torch.rand((1, 2), device="cuda", dtype=torch.float32),
    torch.rand((1, 2), device="cuda", dtype=torch.float64),
    torch.rand((1, 2), device="cuda", dtype=torch.float16),
    torch.rand((1, 2), device="cuda", dtype=torch.bfloat16),
]

cuda_only_int_list = [
    torch.randint(1024, (1, 2), device="cuda", dtype=torch.int64),
]

cpu_list = [
    torch.rand((1, 2), device="cpu", dtype=torch.float32),
    torch.rand((1, 2), device="cpu", dtype=torch.float64),
    torch.rand((1, 2), device="cpu", dtype=torch.float16),
]

none_list = [None]

# differentiable should always make it return false for both
assert _default_to_fused_or_foreach([cuda_only_fp_list], True, True) == (False, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list], True, False) == (False, False)

# cpu lists should always make it return false for both
assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, True) == (False, False)
assert _default_to_fused_or_foreach([cpu_list], False, True) == (False, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, False) == (False, False)
assert _default_to_fused_or_foreach([cpu_list], False, False) == (False, False)

# has fused triggers correctly
assert _default_to_fused_or_foreach([cuda_only_fp_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list], False, False) == (False, True)

# ints always goes to foreach
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, True) == (False, True)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, False) == (False, True)

# Nones don't error
assert _default_to_fused_or_foreach([cuda_only_fp_list, none_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list, none_list], False, True) == (False, True)
assert _default_to_fused_or_foreach([none_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([none_list], False, False) == (False, True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93184
Approved by: https://github.com/albanD
2023-01-30 19:58:55 +00:00
Jane Xu
e714e37a06 [optim][sgd] default to foreach when CUDA + differentiable=False (#92730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92730
Approved by: https://github.com/albanD
2023-01-26 04:52:58 +00:00
Jane Xu
8c9f745af1 [foreach] guard default support on native tensors only (#92923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92923
Approved by: https://github.com/ngimel, https://github.com/crcrpar
2023-01-26 04:52:58 +00:00
Jane Xu
b90496eef5 [nn] zero_grad() set_to_none default True (#92731)
Attempts to fix #92656

BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel
2023-01-26 01:04:28 +00:00
Jane Xu
0d870b50d3 [optim][nadam] group tensors in foreach, make it default (#92715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92715
Approved by: https://github.com/albanD
2023-01-21 05:43:37 +00:00
Jane Xu
9ccf9362c2 [optim][rprop] default to foreach when CUDA + differentiable=False (#92728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92728
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
Jane Xu
c628654724 [optim][rmsprop] default to foreach when CUDA + differentiable=False (#92727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92727
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
Jane Xu
7277247a8c [optim][radam] default to foreach when CUDA + differentiable=False (#92726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92726
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
Jane Xu
9f356568ab [optim][asgd] default to foreach when CUDA + differentiable=False (#92724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92724
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
Jane Xu
30bda6b12b [optim][adamax] default to foreach when CUDA + differentiable=False (#92723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92723
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
Jane Xu
9b4a778420 [optim][adagrad] default to foreach when CUDA + differentiable=False (#92716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92716
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
Jane Xu
de0375e79d [optim][foreach] Do NOT inplace modify gradients (#92706)
SGD and ASGD already had out-of-place grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92706
Approved by: https://github.com/ngimel, https://github.com/albanD
2023-01-21 00:12:28 +00:00
Jane Xu
2b885e1f6c [optim][NAdam] Fix discrepancy between mt vs st impl (#92699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92699
Approved by: https://github.com/albanD
2023-01-21 00:12:28 +00:00
milesial
e4d83d54a6 Foreach gradient clipping (#91846)
Faster gradient clipping using the foreach functions

```
[------------------------ (tensors, scalar) -------------------------]
                                   |  without foreach  |  with foreach |    apex
1 threads: ----------------------------------------------------------------------
      10 tensors of size 4         |         120.5     |       61.1    |     50.3
      100 tensors of size 4        |         946.2     |      239.5    |    136.3
      1000 tensors of size 4       |        9808.5     |     2151.1    |   1006.9
      10000 tensors of size 4      |       96871.2     |    22637.4    |  10119.1
      10 tensors of size 16        |         121.0     |       64.1    |     52.5
      100 tensors of size 16       |         993.4     |      252.6    |    136.7
      1000 tensors of size 16      |        9427.7     |     2151.2    |   1049.5
      10000 tensors of size 16     |       97437.1     |    22203.1    |  10340.0
      10 tensors of size 256       |         118.9     |       62.3    |     51.5
      100 tensors of size 256      |         955.2     |      243.1    |    134.2
      1000 tensors of size 256     |        9374.9     |     2140.7    |   1009.6
      10000 tensors of size 256    |       95302.5     |    21849.4    |  10215.5
      10 tensors of size 65536     |         118.5     |       62.4    |     51.1
      100 tensors of size 65536    |        1740.7     |      243.3    |    225.3
      1000 tensors of size 65536   |       17364.1     |     2228.7    |   2004.5
      10000 tensors of size 65536  |      177510.1     |    25410.4    |  20678.2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91846
Approved by: https://github.com/janeyx99
2023-01-20 21:43:29 +00:00
Jane Xu
b2ca2c8662 [optim][adagrad] group tensors in foreach to maximize perf (#92362)
another one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92362
Approved by: https://github.com/albanD
2023-01-20 16:24:39 +00:00
Jane (Yuan) Xu
3ba5eae72a [optim][radam] fix eps discrepancy for foreach (#92551)
Will likely race with https://github.com/pytorch/pytorch/pull/92365

eps was not being used at all in the mta/foreach impl. There was also a discrepancy between the docs vs the implementation: the implementation was doing sqrt(x) + eps and the docs were doing sqrt(x+eps)).

I've fixed the docs + extended the current multi_tensor test case to capture this issue.

![image](https://user-images.githubusercontent.com/31798555/213300617-61cbb763-da2d-48e0-b3b6-0190594dd049.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92551
Approved by: https://github.com/albanD
2023-01-19 14:38:59 +00:00
Jane Xu
c5cb46ecdb [optim][asgd] group tensors in foreach to maximize perf (#92364)
faster foreach
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92364
Approved by: https://github.com/albanD
2023-01-18 23:09:55 +00:00
Jane Xu
fbafcecf8d [optim][radam] group tensors in foreach to maximize perf (#92365)
Also noticed that eps is not being used nor tested at all for the mta impl of RAdam.

Will fix in a followup PR before turning foreach to default!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92365
Approved by: https://github.com/albanD
2023-01-18 22:32:27 +00:00
Jane Xu
de459bdfaa [optim][rmsprop] group tensors in foreach to maximize perf (#92369)
Test plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92369
Approved by: https://github.com/albanD
2023-01-18 22:28:52 +00:00
Jane Xu
07800c52af [optim][adam] group tensors in foreach to maximize perf (#92349)
same idea as https://github.com/pytorch/pytorch/pull/92338
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92349
Approved by: https://github.com/albanD
2023-01-18 22:05:42 +00:00
Jane (Yuan) Xu
e2433e420c [optim][adamax] group tensors in foreach to maximize perf (#92363)
make foreach faster
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92363
Approved by: https://github.com/albanD
2023-01-18 21:32:28 +00:00
Jane Xu
bb34461f00 [optim][rprop] group tensors in foreach to maximize perf (#92372)
this one had a few more for loops than i was expecting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92372
Approved by: https://github.com/albanD
2023-01-18 20:03:11 +00:00
Jane Xu
0070c546b5 [BE][optim] abstract out docstrings, add differentiable docs (#92336)
1. abstract out common doc strings --> I'm sure there are more, but let this be a first step.
2. Add differentiable docs to those who are actually differentiable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92336
Approved by: https://github.com/albanD
2023-01-18 15:09:28 +00:00
Jane Xu
a41f00ed70 [optim][sgd] group tensors in foreach to maximize perf (#92338)
Make foreach faster for SGD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92338
Approved by: https://github.com/albanD
2023-01-18 04:02:41 +00:00
Jane Xu
0157e2ef4e [optim][adamw] default to foreach when CUDA + differentiable=False (#92306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92306
Approved by: https://github.com/albanD
2023-01-18 00:13:50 +00:00
Jane Xu
4fc796daf9 [optim] abstract out _default_to_foreach_util (#92305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92305
Approved by: https://github.com/albanD
2023-01-17 19:42:20 +00:00
Jane Xu
d41b5d7c14 [adam] Add not torch.jit.is_scripting() as a requirement for switching to fused (#92181)
A "fix" following https://github.com/pytorch/pytorch/pull/90865. Realized that fused is not compatible with torch.jit.is_scripting() when looking at a later line.

Took the opportunity to make the code cleaner/slightly more performant (with the extends) as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92181
Approved by: https://github.com/albanD
2023-01-14 19:05:27 +00:00
Jane Xu
d3765509df [optim][adadelta] default to foreach when CUDA + differentiable=False (#91896)
following up to https://github.com/pytorch/pytorch/pull/90865 and https://github.com/pytorch/pytorch/pull/92048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91896
Approved by: https://github.com/albanD
2023-01-14 01:21:33 +00:00
Jane Xu
4af5939d7a [optim] Improve adadelta foreach, group tensors to maximize fast path (#92048)
Old behavior would have adadelta foreach sending tensors to the slow path if they were not all the same dtype nor on the same device.

This PR adds grouping for adadelta optimizer so that it would run foreach in batches, allowing more users to benefit from foreach perf.

Of course, we should ensure that the new implementation works, so there are new tests to ensure this behavior is not broken.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92048
Approved by: https://github.com/albanD
2023-01-14 00:35:14 +00:00
Anupam Bhatnagar
f4b804eeaa Call profiler step via optimizer post hook (#90101)
This PR adds the `_profile_using_dynolog` function to `torch/__init__.py`. The `_profile_using_dynolog` method allows registering the optimizer step post hook. This is required to collect iteration based traces using dynolog.

Other related changes for tests to pass:
1. Updated `optimizer.pyi`
1. Updated `overrides.py`
1. The test `test_kineto_profiler_multiple_steppers` in `test_profiler.py` has been broken down into two cases:
     - `test_kineto_profiler_multiple_steppers_with_override_True` : this test uses the override argument
     - `test_kineto_profiler_multiple_steppers_with_override_False` : this test uses the environment variable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90101
Approved by: https://github.com/albanD
2023-01-13 18:07:40 +00:00
Nouran Ali
a60125e298 add docstring for adam differentiable parameter (#91881)
Fixes #90467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91881
Approved by: https://github.com/janeyx99
2023-01-13 17:08:27 +00:00
albanD
60e37a6e08 Update sgd doc to insist on momentum buffer initial value (#92111)
Following the discussion in https://github.com/pytorch/pytorch/pull/91108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92111
Approved by: https://github.com/soumith, https://github.com/janeyx99
2023-01-13 15:50:57 +00:00
milesial
9412778d51 Fix OneCycleLR error log (#92040)
If we call the scheduler 11 times but the number of expected steps is 10, we should print `Tried to step 11 times`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92040
Approved by: https://github.com/janeyx99
2023-01-13 02:46:59 +00:00
Jane Xu
ed7885c254 [utils][foreach] Add group tensor by device and dtype util (#92014)
Add util that will be commonly used throughout optim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92014
Approved by: https://github.com/albanD
2023-01-11 23:37:20 +00:00
PyTorch MergeBot
7f2b5ea1e1 Revert "Avoid device casting for all singleton tensors in optimizer states (#91454)"
This reverts commit 1e725c9747.

Reverted https://github.com/pytorch/pytorch/pull/91454 on behalf of https://github.com/janeyx99 due to Likely caused regression where checkpoint resume fails during training
2023-01-10 18:57:50 +00:00
Joel Schlosser
1e725c9747 Avoid device casting for all singleton tensors in optimizer states (#91454)
Fixes #75224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91454
Approved by: https://github.com/janeyx99
2023-01-04 17:55:00 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Adrian Wälchli
f5e20d6060 Make the state dict of CyclicLR scheduler pickleable (#91400)
Fixes #90414

This PR drops the unpicklable `weakref.WeakMethod` object from CyclicLR scheduler from the state dict, and re-inits the object again once the state dict gets loaded. This makes the state picklable so you can include it in your checkpoint. Also fixes https://github.com/Lightning-AI/lightning/issues/15901

A simple test was added that `pickle.dumps(state)` the state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91400
Approved by: https://github.com/albanD
2022-12-28 18:05:24 +00:00
Jane Xu
a061f139dc [optim] Adam defaults to fused when CUDA + differentiable=False (#90865)
Step 1 in faster default optimizers.

Preliminary benchmarks show gaps in improvement on CUDA for BERT_pytorch and resnet18:
![image](https://user-images.githubusercontent.com/31798555/207707118-14221802-77ce-4ee0-96e3-04638c07924c.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90865
Approved by: https://github.com/albanD
2022-12-27 01:28:47 +00:00
richardachen
dafd0432ee Update __init__.py (#91196)
Fixes #91080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91196
Approved by: https://github.com/janeyx99
2022-12-20 23:38:25 +00:00
Michael Lazos
1accd915a4 Re-enable optimizers (#90709)
Fixes
https://github.com/pytorch/pytorch/issues/90165
https://github.com/pytorch/torchdynamo/issues/328

Re-enables optimizer capture + compilation now that the dynamo slowdowns have been fixed

and it has speedups, numbers to come soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90709
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/yanboliang
2022-12-19 04:07:41 +00:00
Soumith Chintala
06326a7721 [optim] skip .item calls in all optimizers when compiling with dynamo (#88173)
@mlazos: skips `item()` calls if compiling with dynamo, by defining a helper function `_get_value` which either returns the result of `.item()` or the scalar cpu tensor if compiling with dynamo. This was done because removing `item()` calls significantly regresses eager perf. Additionally, `_dispatch_sqrt` calls the appropriate sqrt function (math.sqrt, or torch.sqrt).

Fixes https://github.com/pytorch/torchdynamo/issues/1083

This PR will no longer be needed once symint support is default.

This PR closes all remaining graph breaks in the optimizers (!!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88173
Approved by: https://github.com/albanD
2022-12-12 17:32:35 +00:00
Mauricio Villegas
aacafd2cba Fixed a couple of mistakes in type annotations in optim package (#90216)
Doing some tests with all Optimizer and LRScheduler classes in optim package, I noticed a couple of mistakes in type annotations, so created a pull request to fix them.

- In Optimizer class, incorrectly named parameter `default` instead of `defaults` in pyi file
- In SGD class, type for `maximize` and `differentiable` not available in either py or pyi files

I don't know if there is a plan to move all types from pyi to py files, so wasn't too sure where to fix what.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90216
Approved by: https://github.com/janeyx99
2022-12-09 03:20:21 +00:00
Anupam Bhatnagar
6f4dea562d Implement post and pre hooks for optimizer (#89176)
Fixes #88446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89176
Approved by: https://github.com/albanD
2022-12-02 07:03:45 +00:00
Michael Lazos
c63afb283c Disable dynamo on optimizer lazy initialization (#89902)
Helps with https://github.com/pytorch/torchdynamo/issues/1803

Separate out the group initialization and disable dynamo on it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89902
Approved by: https://github.com/soumith, https://github.com/albanD
2022-12-02 01:15:11 +00:00
Michael Lazos
3d47c74cfe Update code style for optimizer code (#89862)
Separating out whitespace-only changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89862
Approved by: https://github.com/albanD, https://github.com/soumith
2022-11-30 00:53:05 +00:00
albanD
c3e85d879c Mention discrepency between original impl and our impl of RAdam (#89575)
Fixes https://github.com/pytorch/pytorch/issues/88836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89575
Approved by: https://github.com/mruberry
2022-11-24 17:11:42 +00:00
Jane Xu
310335de48 Update lr_scheduler.pyi to match lr_scheduler.py (#88818)
Following #88503, we should also update the pyi file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88818
Approved by: https://github.com/soulitzer
2022-11-11 04:02:44 +00:00
Jane Xu
0a69c50a46 Publicly expose _LRScheduler to LRScheduler (#88503)
Fixes #61232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88503
Approved by: https://github.com/soulitzer
2022-11-07 21:15:10 +00:00
Kazuaki Ishizaki
2ddefbdc3c Fix typos used in documents under torch directory (#88300)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
Approved by: https://github.com/lezcano
2022-11-02 09:38:13 +00:00
RangiLyu
512a3a48e3 sync AveragedModel buffers when use_buffers=False (#84054)
Fixes #84053

As described in the issue, the AveragedModel will deep copy the model during initialization, which means that the buffers in the averaged model cannot be updated together with the model.

One solution is to make the buffers equal to the source model every time when calling `update_parameters`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84054
Approved by: https://github.com/samdow
2022-10-24 16:03:14 +00:00
Emilio Castillo
1b43883fd6 Make AdamW, NAdam & RAdam differentiable (#86183)
Blocked by #86096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86183
Approved by: https://github.com/albanD
2022-10-17 04:32:08 +00:00
mikael10j
7dcfbedce0 Fix LinearLR scheduler start_factor (#86695)
Fixes #86454

The `start_factor` must be comprised in ]0;1] instead of [0;1] to avoid division by 0. This PR changes the lower limit checking of the parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86695
Approved by: https://github.com/albanD
2022-10-13 17:31:36 +00:00
Emilio Castillo
cb4867a71a Make ASGD & RProp differentiable (#86258)
Blocked by #86183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86258
Approved by: https://github.com/albanD
2022-10-13 04:06:13 +00:00
Emilio Castillo
aacb9f3ac6 Make Adadelta,Adagrad & Adamax differentiable (#86096)
Continuing the differentiable optimizers support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86096
Approved by: https://github.com/janeyx99
2022-10-12 23:16:29 +00:00
kshitij12345
82229d1e33 [optim] fix: empty grad support for SparseAdam (#86459)
Fixes #82486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86459
Approved by: https://github.com/albanD
2022-10-07 19:24:59 +00:00
Check Deng
b3fdb02fb2 Fix memory leak in _LRScheduler.step() (#85602)
Fixes #85410

This diff removed the cyclic references in `_LRScheduler.step()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602
Approved by: https://github.com/albanD
2022-10-07 15:55:55 +00:00
Tongzhou Wang
5ed75ec1d7 Fix SparseAdam consuming iterator (#86210)
Fixes https://github.com/pytorch/pytorch/issues/86209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86210
Approved by: https://github.com/cpuhrsch
2022-10-06 23:11:25 +00:00
PyTorch MergeBot
233d6f195a Revert "Fix memory leak in _LRScheduler.step() (#85602)"
This reverts commit eb32330d6b.

Reverted https://github.com/pytorch/pytorch/pull/85602 on behalf of https://github.com/albanD due to newly added test is flaky
2022-10-06 22:02:02 +00:00
Chengqi Deng
eb32330d6b Fix memory leak in _LRScheduler.step() (#85602)
Fixes #85410

This diff removed the cyclic references in `_LRScheduler.step()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602
Approved by: https://github.com/albanD
2022-10-06 17:07:36 +00:00
Masaki Kozuki
5f26df0345 resubmit: "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)" (#85739)
Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)) to a new header file and let FusedAdam use them to tame MSVC, hopefully.

cc @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739
Approved by: https://github.com/ngimel
2022-09-29 16:58:59 +00:00
Seonglyong Gong
f80ef73d1c [Profiler] tracking Optimizer (part 2 of Record Optimizer) (#84920)
Summary:
Part 2 of Record Optimizer param_groups and states (https://github.com/pytorch/pytorch/pull/84063)
- hooking from optimizer step
- PyOptCall Type
- declare data type for collection
- python binding
- simple unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler

Differential Revision: D39402667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84920
Approved by: https://github.com/robieta
2022-09-28 02:48:07 +00:00
Peter Jung
9f1468ae6c CyclicLR memory leak fix (#85462)
Hi, we noticed in our team that by using CyclicLR, there is a problem with memory clearance on GPU (probably it will be the case without the GPU as well, but that was our use case) After initializing CyclicLR, GPU memory is not cleared even after the model, optimizer and scheduler are out of scope (e.g. reference count is zero). This is because `__init__` method inside `CyclicLR` creates reference to its own methods and it will not get removed until `gc.collect()` is called manually. This is a problem if people want to test multiple models in one run of a script, after testing the first model, second one will fail on `CUDA out of memory error` because the first one is not cleared from the memory.

I propose a simple fix by using `weakref`, similarly as in `_LRScheduler` base class, but if you have any comments I am happy to change it.

Here is the code to reproduce the bug:

```
import torch
import weakref
from transformers import DetrForObjectDetection

class X:
    def __init__(self, optimizer):
        self.optimizer = optimizer

        # Will cause cyclic reference.
        self.func = self.dummy

        # Will work as expected, memory cleared after instance count is zero.
        # self.func = weakref.WeakMethod(self.dummy)

    def dummy(self, x):
        return 1.

def test():
    model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')
    model.to('cuda')
    optimizer = torch.optim.Adam(model.parameters())
    x = X(optimizer)

test()
print(f'{torch.cuda.memory_reserved()}, {torch.cuda.memory_allocated()}')  # Should print (<some memory>, 0), but with cyclic reference, it will print (<some memory>, <some memory>).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85462
Approved by: https://github.com/albanD
2022-09-27 17:41:58 +00:00
PyTorch MergeBot
7167996346 Revert "resubmit: [mta] APEX style Fused Adam (#81705) (#85507)"
This reverts commit 4615d1bcfa.

Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds
2022-09-27 16:59:35 +00:00
Masaki Kozuki
4615d1bcfa resubmit: [mta] APEX style Fused Adam (#81705) (#85507)
This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
Approved by: https://github.com/ngimel

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507
Approved by: https://github.com/ngimel
2022-09-23 18:56:00 +00:00
PyTorch MergeBot
e505360eb8 Revert "[mta] APEX style Fused Adam (#81705)"
This reverts commit 7a6c4d0c50.

Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come
2022-09-22 19:37:29 +00:00
Masaki Kozuki
7a6c4d0c50 [mta] APEX style Fused Adam (#81705)
This PR implements an APEX style FusedAdam in PyTorch.
This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel.

related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705
Approved by: https://github.com/ngimel
2022-09-20 17:18:33 +00:00
F-G Fernandez
7243264c61 fix: Allowed optimizers with more than 2 betas (#84486)
Hello there 👋

As discussed in #84485, this PR enables more flexibility on the optimizers that are wrapped by LR schedulers in PyTorch. Currently, it is incompatible with optimizers that have a number of betas different than 2. This PR fixes that with minimal modifications.

Fixes #84485

Any feedback is welcome!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84486
Approved by: https://github.com/Lezcano, https://github.com/soulitzer
2022-09-06 19:24:10 +00:00
kshitij12345
faac3dbce2 [optim] asgd : handle complex params as independent real params (#84472)
Ref: #65711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84472
Approved by: https://github.com/Lezcano, https://github.com/soulitzer
2022-09-06 16:58:42 +00:00
kshitij12345
7c20ad3dfa [optim] rprop: handle complex params as independent real params (#83858)
Ref #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858
Approved by: https://github.com/albanD
2022-08-23 08:39:35 +00:00
Kshiteej K
09331c947c [optim] rmsprop: handle complex params as independent real params (#83860)
Ref: #65711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83860
Approved by: https://github.com/albanD
2022-08-22 21:55:01 +00:00
joncrall
b136f3f310 More doctest refinements. (#83317)
Follow up to #82797

Now that the doctests themselves are in a better state, we should be able to enable xdoctest on the CI so they stay that way.

@ezyang @vadimkantorov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83317
Approved by: https://github.com/ezyang
2022-08-22 20:07:26 +00:00
Emilio Castillo
f0eb841d20 Make torch.optim.RMSprop differentiable (#83578)
Blocked by #82205
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83578
Approved by: https://github.com/albanD
2022-08-22 03:37:10 +00:00
albanD
84c4b07932 Make sure that we can load old optimizer checkpoint (#83588)
We want to make sure that we can load checkpoints that were saved with older version of the code (which doesn't contain the differentiable attribute).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83588
Approved by: https://github.com/mikaylagawarecki
2022-08-17 15:08:05 +00:00
Emilio Castillo
5aab57e112 Make Adam optimizer differentiable (#82205)
Continues [80938](https://github.com/pytorch/pytorch/pull/80938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82205
Approved by: https://github.com/albanD
2022-08-17 07:20:37 +00:00
Rob Zinkov
ff75562cff Adding maximize to rprop (#81864)
Added the maximize flag #68052 to rprop optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81864
Approved by: https://github.com/albanD
2022-08-16 08:19:46 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Federico Pozzi
f8a10a7f79 feat: add PolynomialLR scheduler (#82769)
### Description
<!-- What did you change and why was it needed? -->

Add PolynomialLR scheduler.

### Issue
Closes #79511.

### Testing
I added tests for PolynomialLR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82769
Approved by: https://github.com/datumbox
2022-08-10 18:21:00 +00:00
Rob Zinkov
c54d18dbc7 Handle complex optimization in Adamax by treating complex numbers as 2D real numbers (#80319)
This commit partially addresses #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80319
Approved by: https://github.com/albanD
2022-08-05 21:03:18 +00:00
Rob Zinkov
dcbe9ce2ad Handle complex optimization in AdamW by treating complex numbers as 2D real numbers (#80280)
This commit partially addresses #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80280
Approved by: https://github.com/albanD
2022-08-05 13:47:14 +00:00
Masaki Kozuki
3139722679 [foreach][mta] Inplace maximum and minimum (#82523)
### Description
<!-- What did you change and why was it needed? -->
Implement `torch._foreach_maximum_` and `torch._foreach_minimum_` mainly for `_multi_tensor_adam` and `_multi_tensor_adamw` with `amsgrad=True` to correctly update their `max_exp_avg_sqs`.

### Issue
<!-- Link to Issue ticket or RFP -->
- https://github.com/pytorch/pytorch/issues/78807
- https://github.com/pytorch/pytorch/pull/81894
- https://github.com/pytorch/pytorch/pull/81348
- https://github.com/pytorch/pytorch/pull/81705
- https://github.com/pytorch/pytorch/issues/58833
- https://github.com/pytorch/pytorch/issues/68041

### Testing
<!-- How did you test your change? -->
Updated `test_foreach.py::TestForeach::_minmax_test` to compare the outputs of `_foreach_maximum_` (and `_foreach_minimum_`) against those of `[torch.maximum(a, b) for a, b in zip(tensors1, tensors2)]`

cc @ngimel @albanD @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82523
Approved by: https://github.com/albanD
2022-08-03 03:40:42 +00:00
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
ProGamerGov
357b7d589c Fix docstring inconsistencies: string -> str, boolean -> bool (#82410)
### Description

Throughout the PyTorch docs and codebase, the `string` type in docstrings is referred to by two separate names. This leads to inconsistent docs, like you can see here: https://pytorch.org/docs/stable/generated/torch.nn.Conv3d.html#torch.nn.Conv3d

This PR fixes this issue by ensuring that all mentions of the string type in docstrings, are using the same format that Sphinx generates hyperlinks for.

### Testing
No testing should be required for this change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82410
Approved by: https://github.com/jbschlosser
2022-07-28 21:29:57 +00:00
Rob Zinkov
f9ef363982 Modifying Adam to support complex numbers as 2d real numbers (#80279)
This commit addresses issues in #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80279
Approved by: https://github.com/albanD
2022-07-27 18:39:40 +00:00
Sudarshan Raghunathan
52aae5aa19 [Sparse Adam] Fix error in loading serialized models due to introduction of new parameter (#82273)
### Description
PR #80336  introduced a new parameter to the Sparse Adam optimizer. The new parameter is accessed inside the `step` method of the optimizer. If we try to deserialize and run an older version of the optimizer before this change was introduced, it fails in the step that tries to access the missing parameter.

I have added a workaround to set a default value in case the parameter is unavailable in the optimizer.

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
* Testing on PyTorch CI
* Manual validation against existing serialized models to make sure they continue to work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82273
Approved by: https://github.com/mehtanirav, https://github.com/albanD
2022-07-27 12:48:38 +00:00
albanD
312ece7f65 fix sgd maximize when momentum is involved (#81859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81859
Approved by: https://github.com/jbschlosser
2022-07-26 16:48:32 +00:00
Emilio Castillo
49b4f45781 Add initial support for differentiable optimizers (#80938)
Adds the `differentiable` argument, a method for updating parameters in an existing optimizer, and a template for testing the differentiability of multiple optimizers.

This is all based in discussions with @albanD & @jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80938
Approved by: https://github.com/albanD
2022-07-25 13:37:08 +00:00
Rob Zinkov
50c655d5e3 Adding maximize to ASGD (#81875)
Added the maximize flag #68052 to ASGD optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81875
Approved by: https://github.com/albanD
2022-07-22 17:05:41 +00:00
PyTorch MergeBot
135af0fe30 Revert "Adding maximize to ASGD (#80323)"
This reverts commit 14bd5bd6ee.

Reverted https://github.com/pytorch/pytorch/pull/80323 on behalf of https://github.com/albanD due to Broke rocm test
2022-07-08 13:35:31 +00:00
PyTorch MergeBot
0b8a5ca01b Revert "Adding maximize to rprop (#80335)"
This reverts commit 495aa9bc3a.

Reverted https://github.com/pytorch/pytorch/pull/80335 on behalf of https://github.com/albanD due to Broke rocm and windows test
2022-07-08 13:34:02 +00:00
Rob Zinkov
f24c94d7ae Adding maximize to SparseAdam (#80336)
Added the maximize flag #68052 to SparseAdam optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80336
Approved by: https://github.com/albanD
2022-07-08 12:17:27 +00:00
Rob Zinkov
495aa9bc3a Adding maximize to rprop (#80335)
Added the maximize flag #68052 to rprop optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80335
Approved by: https://github.com/albanD
2022-07-08 08:04:38 +00:00
Rob Zinkov
a1fd5b4273 Adding maximize to RMSprop (#80326)
Added the maximize flag #68052 to RMSprop optimizer and updates the respective tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80326
Approved by: https://github.com/albanD
2022-07-08 08:04:26 +00:00
Rob Zinkov
14bd5bd6ee Adding maximize to ASGD (#80323)
Added the maximize flag #68052 to ASGD optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80323
Approved by: https://github.com/albanD
2022-07-08 08:03:36 +00:00
albanD
9d20af5060 remove overly restrictive checks for cudagraph (#80881)
Finish fixing https://github.com/pytorch/pytorch/issues/80809
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80881
Approved by: https://github.com/jbschlosser
2022-07-06 18:08:49 +00:00
Edward Z. Yang
57f001f35a Don't error if _warned_capturable_if_run_uncaptured not set (#80345)
This can happen if an optimizer was pickled.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80345
Approved by: https://github.com/malfet, https://github.com/albanD
2022-06-29 03:46:22 +00:00
anjali411
bda04e9f5e Add __all__ for torch.optim and torch.nn.modules modules (#80237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80237
Approved by: https://github.com/albanD
2022-06-24 21:34:10 +00:00
Sergii Dymchenko
de7219e8a7 Use generators with all/any in torch/optim (#78142)
Generator comprehensions with any/all are less verbose and potentially help to save memory/CPU : https://eklitzke.org/generator-comprehensions-and-using-any-and-all-in-python

To make JIT work with this change, I added code to convert GeneratorExp to ListComp. So the whole PR is basically NoOp for JIT, but potentially memory and speed improvement for eager mode.

Also I removed a test from test/jit/test_parametrization.py. The test was bad and had a TODO to actually implement and just tested that UnsupportedNodeError is thrown, and with GeneratorExp support a different error would be thrown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78142
Approved by: https://github.com/malfet, https://github.com/albanD
2022-06-24 17:23:45 +00:00
albanD
375668cd96 Remove overly restrictive assert in adam (#80222)
This is causing issues if the user has the step on cuda for a good reason.

These assert prevents code that used to run just fine to fail.
Note that this is a pretty bad thing to do for performance though so it is ok to try and push users away from doing it.

For the 1.12.1 milestone: this is not asking for a dot release to fix this (as this is bad practice anyways). But it would be a great thing to add if we do one: it is very low risk and will prevent breakage for users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80222
Approved by: https://github.com/jbschlosser, https://github.com/ngimel
2022-06-24 17:08:34 +00:00
Antonio Kim
765b6a8fab Fix SequentialLR initialization (#72856)
What was happening is that when we have multiple learning rate schedulers, the order in which they are being initialized is not being taken into account. This is a problem if they were being initialized in sequential order (as one might intuitively do).

Each scheduler calls `step()` on initialization and sets the `lr` in its optimizer's `params_groups`. However, this means that step 0 will be using the `lr` that was set by the very last scheduler (in the case of initializing schedulers sequentially) instead of the first scheduler.

The fix in this PR, addresses the above bug by performing a call to the appropriate scheduler on initialization after decrementing the `last_epoch` values in order to keep them the same post-step. This will ensure that the correct scheduler is the one setting the `lr` values for the optimizer's `param_groups`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72856
Approved by: https://github.com/jbschlosser
2022-06-21 20:21:13 +00:00
Janosh Riebesell
660d9ddef4 Fix SWALR doc string (#79836)
In `torch/optim/swa_utils.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79836
Approved by: https://github.com/albanD
2022-06-20 12:57:07 +00:00
Sebastian Brodehl
fb9d8de379 Make LR scheduler stub complete, including OneCycleLR and class attributes. (#59476)
This PR completes the stub file for lr scheduler and includes a previously missing scheduler, namely `OneCycleLR, and adds additional class attributes and methods for all lr scheduler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59476
Approved by: https://github.com/jbschlosser
2022-06-17 16:39:13 +00:00
Madhushan B
9acbaaaf05 Fix typo in ChainedScheduler docstring (#79775)
### Goal
Fixes https://github.com/pytorch/pytorch/issues/79720

### Approach
replace `Chains list of learning rate schedulers. It takes a list of chainable learning rate schedulers and performs consecutive step() functions` **`belong`** `to them by just one call.` with `Chains list of learning rate schedulers. It takes a list of chainable learning rate schedulers and performs consecutive step() functions` **`belonging`** `to them by just one call.`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79775
Approved by: https://github.com/albanD
2022-06-17 14:18:42 +00:00
Michael Carilli
ba27ee9e8f [CUDA graphs] Allows Adam and AdamW to be capture-safe (#77862)
Near term fix for https://github.com/pytorch/pytorch/issues/76368.

Q. Why does the user need to request `capturable=True` in the optimizer constructor? Why can't capture safety be completely automatic?
A. We need to set up capture-safe (device-side) state variables before capture. If we don't, and step() internally detects capture is underway, it's too late: the best we could do is create a device state variable and copy the current CPU value into it, which is not something we want baked into the graph.

Q. Ok, why not just do the capture-safe approach with device-side state variables all the time?
A. It incurs several more kernel launches per parameter, which could really add up and regress cpu overhead for ungraphed step()s. If the optimizer won't be captured, we should allow step() to stick with its current cpu-side state handling.

Q. But cuda RNG is a stateful thing that maintains its state on the cpu outside of capture and replay, and we capture it automatically. Why can't we do the same thing here?
A. The graph object can handle RNG generator increments because its capture_begin, capture_end, and replay() methods can see and access generator object. But the graph object has no explicit knowledge of or access to optimizer steps in its capture scope. We could let the user tell the graph object what optimizers will be stepped in its scope, ie something like
```python
graph.will_use_optimizer(opt)
graph.capture_begin()
...
```
but that seems clunkier than an optimizer constructor arg.

I'm open to other ideas, but right now I think constructor arg is necessary and the least bad approach.

Long term, https://github.com/pytorch/pytorch/issues/71274 is a better fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77862
Approved by: https://github.com/ezyang
2022-06-13 01:56:47 +00:00
Rob Zinkov
2a496e2f80 Adding maximize to Adamax (#77409)
Added the maximize flag #68052 to Adamax optimizer and updates the respective tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77409
Approved by: https://github.com/albanD
2022-05-16 17:34:44 +00:00
James Reed
57b54dfec5 Fix Optimizer.zero_grad type annotation (#76998)
`Optimizer.zero_grad()` defines the `set_to_none` argument as `bool`, not `Optional[bool]`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76998
Approved by: https://github.com/albanD
2022-05-11 00:05:26 +00:00
tomMoral
ff94c9dee4 DOC fix momentum equation for nesterov
Fix https://github.com/pytorch/pytorch/issues/72395

This is a small fix in the doc for an indice in this equation:

![image](https://user-images.githubusercontent.com/3321081/166165461-140855b5-96b5-4417-85fc-2a170f95700a.png)

I think teh indice should not be `t-1` but `t`. This is coherent with [the implementation)[https://github.com/pytorch/pytorch/blob/master/torch/optim/sgd.py#L236] and with what is done for instance in [keras](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76639
Approved by: https://github.com/albanD
2022-05-04 20:40:21 +00:00
Emilio Castillo
e5ee6f5cf7 Fix CosineAnnealingLR on restart
Fixes #60265

The initial LR for this scheduler is not consistent when a new instance is created with `last_epoch != -1`

Maybe we can refactor the testing code to test `last_epoch != -1` in schedulers that can recreate their state from the current epoch?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60339
Approved by: https://github.com/albanD
2022-04-20 13:35:01 +00:00
Rob Zinkov
6642e88ad2 Adding maximize flag to Adagrad
This adds maximize to Adagrad (#68052) along with updates the respective tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75968
Approved by: https://github.com/albanD
2022-04-20 08:29:03 +00:00
Jake Tae
3b18bc36f3 Docs: Add missing zero-ing step in Rprop algorithm
Fixes ##70418.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75555
Approved by: https://github.com/albanD
2022-04-11 21:57:13 +00:00
francescocastelli
58a44523c1 Add maximize flag to Adadelta
Added the maximize flag to Adadelta optimizer (#68052) and adjusted tests to take maximize into account.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75330
Approved by: https://github.com/cpuhrsch
2022-04-08 20:32:35 +00:00
Mikayla Gawarecki
10bb0ffe69 Fix casting bug in state_step for optimizers when loading state dict
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75214

Approved by: https://github.com/albanD
2022-04-05 01:27:18 +00:00
Jan Zikes
715a0dc5c0 [PyTorch/d2go] fix optim _multi_tensor (#73215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73215

Fixing an issue in optimizers from _multi_tensor, for `sgd_mt` introduced in 2cb03e926f

Reviewed By: mikaylagawarecki

Differential Revision: D34389034

fbshipit-source-id: ede153d52dca15909c6c022853589707f18dc8d1
(cherry picked from commit cc8a58e584)
2022-02-23 10:29:48 +00:00
Sergii Dymchenko
313557a613 Add missing import (#72840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72840

Reviewed By: H-Huang

Differential Revision: D34242612

Pulled By: albanD

fbshipit-source-id: 3dd34de96dbf1ae8f3c3ea45888d211d95862c49
(cherry picked from commit d2650ffa75)
2022-02-15 19:43:54 +00:00
Mikayla Gawarecki
2a5aaf1c49 Optim foreach cleanup for AdamW (#70484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70484

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767869

Pulled By: mikaylagawarecki

fbshipit-source-id: 2f5273bbfeea3ed502c5d77da4bebe1674243e86
(cherry picked from commit 2dd9b77917)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
dff58d519f Optim foreach cleanup for Rprop (#70483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70483

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767866

Pulled By: mikaylagawarecki

fbshipit-source-id: ffc5ae68eeea8fa09385862b853b731554b77bcb
(cherry picked from commit 3a0fe29580)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
ce3094f5f6 Optim foreach cleanup for Rmsprop (#70482)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70482

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767862

Pulled By: mikaylagawarecki

fbshipit-source-id: 8e2e9c986d5a3774093a79755940372945f1b3a9
(cherry picked from commit baea537277)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
2cb03e926f Optim foreach cleanup for SGD (#70481)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70481

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767868

Pulled By: mikaylagawarecki

fbshipit-source-id: 89b9227a4ddf99602855973cbc343c58ae3d5328
(cherry picked from commit ffea8ddcfd)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
5f9590681d Optim foreach cleanup for Adam (#70295)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70295

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767870

Pulled By: mikaylagawarecki

fbshipit-source-id: f922f15ecb0307458c8ecee737325c42c4f3ce8b
(cherry picked from commit 66233a8a3e)
2022-02-15 18:02:08 +00:00
Mikayla Gawarecki
0972db5b7d Optim foreach cleanup for ASGD (#70231)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70231

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767867

Pulled By: mikaylagawarecki

fbshipit-source-id: 4406824acbb6f427d52c1ced2d8a02a98c943b86
(cherry picked from commit cbd9a4da15)
2022-02-09 16:52:13 +00:00
Mikayla Gawarecki
5948522e9c Optim foreach cleanup for RAdam (#70230)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70230

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767874

Pulled By: mikaylagawarecki

fbshipit-source-id: 9379db24266a7bbcc2c23849f87ae0af2e6729c0
(cherry picked from commit ecf7b31fc3)
2022-02-09 16:52:13 +00:00