Commit Graph

36 Commits

Author SHA1 Message Date
Jon Chuang
c99de9f37c fix(optim): adagrad sparse multitensor incorrect early exit (#110454)
Fixes https://github.com/pytorch/pytorch/issues/110444#issuecomment-1745181530

This PR:
Passes

Main:
```
test/optim/test_optim.py::TestOptim::test_adagrad_sparse FAILED [0.0058s]

==================================================================================================================================== FAILURES =====================================================================================================================================
__________________________________________________________________________________________________________________________ TestOptim.test_adagrad_sparse __________________________________________________________________________________________________________________________
Traceback (most recent call last):
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 1448, in test_adagrad_sparse
    self._test_rosenbrock_sparse(
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 128, in _test_rosenbrock_sparse
    self.assertEqual(params, params_c, atol=1e-6, rtol=1e-6)
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/torch/testing/_internal/common_utils.py", line 3309, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 0.09999999999993325 at index (1,) (up to 1e-06 allowed)
Greatest relative difference: 0.06249999999996089 at index (1,) (up to 1e-06 allowed)

```

CC: @janeyx99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110454
Approved by: https://github.com/janeyx99
2023-10-05 20:37:57 +00:00
Jane Xu
9f40ffeec6 [optim] disable large_tensor tests for ROCm (#110559)
Closes #105825 #105820 #105754 by replacing with an incode skip.

Fixes #105825, fixes #105820, fixes #105754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110559
Approved by: https://github.com/albanD
2023-10-05 01:21:21 +00:00
Michael Lazos
b193f295b6 Add capturable ASGD impl (#107857)
Add capturable ASGD impl + test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107857
Approved by: https://github.com/janeyx99
2023-09-07 06:30:30 +00:00
bilzard
18a58f0bd6 Implement "RAdamW" optimizer (#107507)
Fixes #107282

## Overview

- basic design decision was followed as they made on #103881 (tensor operation, test cases, order & position of argument etc.)
- for the algorithm for decoupled weight decay, I referred to [1, 2]

## backwards-incompatible changes

- positional argument `decoupled_weight_decay` is added to:
    -  `torch.optim.radam`

The existing code which refers to these APIs can be affected.

Note: Positional argument `decoupled_weight_decay` is added to `torch.optim.RAdam`. However, since it was added to the last position and with default value, it is not affected.

## Reference

- [1] [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)
- [2] https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam/radam.py#L5-L94

## TODO

- [x] implement tensor operation
- [x] implement test cases
- [x] modify doc-string
- [x] pass unit test code locally `python test/test_optim.py -k test_radam`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107507
Approved by: https://github.com/janeyx99
2023-08-28 20:50:25 +00:00
PyTorch MergeBot
3a3cf0e09d Revert "[optim] Make casting to match params a hook (#106725)"
This reverts commit 9f86d85172.

Reverted https://github.com/pytorch/pytorch/pull/106725 on behalf of https://github.com/janeyx99 due to We acknowledge this is a huge risk because people do not remember to call super().__init__ from their Optimizer subclasses and so this will break lots of load_state_dict behavior ([comment](https://github.com/pytorch/pytorch/pull/106725#issuecomment-1693386137))
2023-08-25 13:47:19 +00:00
Jane Xu
9f86d85172 [optim] Make casting to match params a hook (#106725)
Moves the logic to casting state to match parameters into a hook so that users can choose to enable their hooks before or after the casting has happened.

With this, there is a little bit of redundancy of the id_map building and the check that the param groups are still aligned in length.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106725
Approved by: https://github.com/albanD
2023-08-23 22:25:33 +00:00
Jane Xu
1641d671e5 [optim] FusedAdam/W accepts lr: Tensor without h2ds (#106916)
Starts addressing #106802

This PR also conveniently does some BE:
- Fixes a bug in adamw where we use amsgrad instead of per group amsgrad
- Brings the impls of adamw and adam closer to correctness and to each other

I couldn't fully remove the .pyi's because mypy was going to complain about the entire files which scared me and shouldn't go in this PR anyway.

Test plan:
- Add tests to ensure that lr could be passed as a Tensor
- Did some profiling of the below code (runs 1k iterations of step for Adam)

```
import torch
from torch.testing._internal.common_utils import TestCase

param = torch.rand(2, 3, dtype=torch.float, device='cuda:0', requires_grad=True)
param.grad = torch.rand_like(param)

lr = torch.tensor(.001, device='cuda:0')
opt = torch.optim.Adam([param], lr=lr, fused=True)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    for _ in range(1000):
        opt.step()

print(p.key_averages().table(sort_by="cpu_time_total"))

```

Before my change:
<img width="1381" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/cfc5175a-0f41-4829-941f-342554f3b152">

After my change (notice there are no d2h syncs and the CPU time is lower!):
![image](https://github.com/pytorch/pytorch/assets/31798555/726d7e66-dcff-4a4f-8a75-e84329961989)

Next steps long term:
- have all capturable foreach + forloop impls in Adam(W) handle tensor LR
- have all capturable impls handle tensor LR
- have all impls handle tensor LR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106916
Approved by: https://github.com/albanD
2023-08-21 23:00:44 +00:00
Jane Xu
c0f80c6696 [forward-fix] Fix multigpu varying tensor optim tests (#106887)
Forward fixes https://github.com/pytorch/pytorch/pull/106615 by increasing tolerance in the test.

The capturable implementation for foreach simply varies due to a different order of operations when updating params. I had also attempted to compare against fp64 but that introduced more disparity in the other optimizer configs. It is worth trying the fp64 comparison at a later point, but let's get the test passing first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106887
Approved by: https://github.com/izaitsevfb
2023-08-10 16:35:38 +00:00
Jane Xu
0208574db9 [NAdam] Add capturable API and tests + fix differentiable (#106615)
This PR:
- adds a capturable API for NAdam similar to Adam(W)
- adds tests accordingly
- discovered and fixed bugs in the differentiable implementation (now tested through the capturable codepath).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106615
Approved by: https://github.com/albanD
2023-08-07 19:49:11 +00:00
Jane Xu
59d0dea90f Only make a shallow copy when loading optimizer state_dict (#106082)
The thing we do still deep copy is the param_groups, which is much lighter weight. This should also save memory when loading from a checkpoint.

The deepcopy was introduced in ecfcf39f30, but module.py had only a shallow copy at that point so it did not actually bring parity.

Incorporates an XLA fix, which is why I'm updating the pin to ca5eab87a7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106082
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-08-01 05:33:31 +00:00
Jane Xu
23f47f746b [optim][rprop] Minimize intermediates=1 for foreach to save memory (#105193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105193
Approved by: https://github.com/albanD
2023-07-31 20:59:26 +00:00
Jane Xu
dffa4e14b9 Add Optimizer state_dict hooks (#105953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105953
Approved by: https://github.com/albanD
2023-07-28 11:52:41 +00:00
janEbert
b0708654c0 Implement NAdamW optimizer (#103881)
NAdamW, which is simply NAdam with the AdamW weight decay term, has shown strong performance in optimizer comparisons such as
1. https://arxiv.org/abs/2211.09760
1. https://arxiv.org/abs/2306.07179

[The VeLO paper](https://arxiv.org/abs/2211.09760) argues its power lies in its ability to act as a superset of other popular optimizers.

This PR adds NAdamW by ~~copying and making very small adaptations to the NAdam implementation (just like AdamW and Adam). To see the small changes in better detail, you can `diff torch/optim/nadam.py torch/optim/nadamw.py`.~~ adding a boolean flag `decoupled_weight_decay` that activates NAdamW behavior (`False` by default) to NAdam.

Interest in the optimizer has also been shown in the PyTorch forums:
https://discuss.pytorch.org/t/nadamw-and-demon-optimizers/179778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103881
Approved by: https://github.com/janeyx99
2023-07-24 19:29:26 +00:00
Jane Xu
1959802548 [AdamW] Fix complex x amsgrad support (#104990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104990
Approved by: https://github.com/albanD
2023-07-21 23:43:26 +00:00
Jane Xu
e1296a7f8d [Adam] Fix complex x amsgrad support (#104989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104989
Approved by: https://github.com/albanD
2023-07-21 23:43:26 +00:00
Jane Xu
e855348cdf [foreach][SGD] minimize intermediates=1 to decrease peak memory (#105599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105599
Approved by: https://github.com/albanD
2023-07-20 17:06:52 +00:00
Justin Chu
3721fa5612 [BE] Enable ruff's UP rules and autoformat optim/ (#105426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105426
Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi, https://github.com/janeyx99
2023-07-18 21:07:43 +00:00
Jane Xu
cd15229950 [foreach][RMSprop] Minimize intermediates=2 to decrease peak memory (#105161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105161
Approved by: https://github.com/albanD
2023-07-13 23:18:54 +00:00
Jane Xu
219cf2a1c8 [foreach][ASGD] Minimize intermediates=1 to decrease peak memory (#105146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105146
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-07-13 23:18:54 +00:00
Jane Xu
0bc382ea55 [foreach][Adamax] Minimize intermediates=1 to decrease peak memory (#104991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104991
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-12 03:09:17 +00:00
Jane Xu
ea6a563a8c [foreach][Adagrad] Minimize intermediates=2 to decrease peak memory (#104988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104988
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-12 03:09:17 +00:00
Jane Xu
455f495f04 [foreach][Adadelta] Minimize intermediates=3 to decrease peak memory (#104983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104983
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-12 03:09:15 +00:00
Jane Xu
15aa401baa [foreach][NAdam] Minimize use of intermediates to decrease peak memory (#104910)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104910
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-07-11 17:08:07 +00:00
Jane Xu
6878d3a157 [foreach][RAdam] Minimize use of intermediates to decrease peak memory (#104904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104904
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-11 17:08:07 +00:00
Jane Xu
7e9c891056 [foreach][AdamW] Minimize intermediates to save peak memory (#104898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104898
Approved by: https://github.com/albanD
2023-07-10 23:40:52 +00:00
Jane Xu
35f0e35529 [foreach][Adam] Minimize use of intermediates to decrease peak memory (#104780)
Starts addressing https://github.com/pytorch/pytorch/issues/97712 by
- Minimizing intermediates usage for foreach Adam
- Document the extra memory usage
- Add comments within the code for clarity now that we reuse intermediates
- Add tests
- Did some refactoring

Next steps involve doing this for all other foreach implementations. Note that even after this change, foreach mem usage will be higher than forloop due to the fact that we have a minimum budget of 1 intermediate (to not muddle the input values) and the intermediate will be larger. For capturable, the memory usage is higher due to moving more tensors to CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104780
Approved by: https://github.com/albanD
2023-07-10 17:38:46 +00:00
Jane Xu
038cb4075a Add capturable/maximize tests to Adam(W) optim configs (#104669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104669
Approved by: https://github.com/albanD
2023-07-10 17:38:46 +00:00
cyy
54cb61f7d9 enable ASAN on some tests (#103647)
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
 but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
2023-06-28 02:17:14 +00:00
Jane Xu
fa893f3f58 Fix optim state_dict casting to allow step to cast to CPU (#102619)
I'm guessing this should fix https://github.com/pytorch/pytorch/pull/88015#issuecomment-1569523106 but am waiting on @ychfan to supply more details so I could write a good test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102619
Approved by: https://github.com/albanD
2023-06-13 00:46:40 +00:00
Jane Xu
4a5d56b74c Disable dynamo'd test_optim entirely (#103323)
See issue https://github.com/pytorch/pytorch/issues/103322.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103323
Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/malfet
2023-06-09 16:06:36 +00:00
Michael Lazos
0769a50a5f Disable dynamo on some opt methods and differentiable optimizer tests (#103066)
- Disables dynamo on the differentiable optimizer tests
- Disables dynamo on some test methods which expose a very rare dynamo edge case
- Disables dynamo on export/save optimizer state methods because it shouldn't trace those anyway.

I have a draft PR to fix the two tests marked skip due to unsupported mutation of step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103066
Approved by: https://github.com/janeyx99, https://github.com/malfet
2023-06-07 03:50:42 +00:00
Catherine Lee
08c4a442fd Dont run test files that are already run in test_optim (#103017)
they get run twice on accident
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103017
Approved by: https://github.com/janeyx99
2023-06-06 17:31:21 +00:00
Michael Lazos
c75e064dd6 Disallow _foreach_utils.py, but allow it to be inlined (#102221)
This function should not be allowed, but should be inlineable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102221
Approved by: https://github.com/anijain2305
2023-06-02 05:14:09 +00:00
Masaki Kozuki
401109a243 Use int64_t for indexing in multi_tensor_apply (#101760)
Fixes #101449

I found it better to either imitate the combo of `TensorIterator::can_use_32bit_indexing` and `TensorIterator::with_32bit_indexing` or adroitly choose the index type depending on `Tensor::numel` in the future.

---

Used `nsys nvprof` to casually see the effect of `int64_t` indexing:

```python
import torch

params = [
    {"params": [torch.randn(32, 32, device="cuda") for _ in range(100)]},
    {"params": [torch.randn(32, 32, device="cuda") for _ in range(100)]},
]
grads = [
    [torch.randn(32, 32, device="cuda") for _ in range(100)],
    [torch.randn(32, 32, device="cuda") for _ in range(100)],
]
optimizer = torch.optim.Adam(params, fused=True)

for _ in range(100):
    for i, param_groups in enumerate(params):
        for p, g in zip(param_groups["params"], grads[i]):
            p.grad = g
        optimizer.step()
        optimizer.zero_grad()
```

Environment
```
Collecting environment information...
PyTorch version: 2.1.0a0+gitf994d0b
Is debug build: False
CUDA used to build PyTorch: 12.1

Python version: 3.10.9 (main, May 17 2023, 00:46:40) [GCC 11.3.0] (64-bit runtime)
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
```

---

- `multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensor` -> 1.02x
- `multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…` -> 1.04x

Current main branch:

```
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     64.9          5787610        600    9646.0    9632.0      9503      9888         52.9  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorLi…
...
      8.1           720575        200    3602.9    3584.0      3551      4320         63.4  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
```

this PR:

```
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     65.0          5876847        600    9794.7    9792.0      9632     10080         58.1  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorLi…
...
      8.3           748313        200    3741.6    3744.0      3711      4479         60.0  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101760
Approved by: https://github.com/ngimel
2023-06-01 20:55:09 +00:00
Masaki Kozuki
2fcc2002fa Handle tail 0-size tensor appropriately in MultiTensorApply (#100811)
Fixes #100701

It seems like we don't call `multi_tensor_apply_kernel` at all if the input tensor lists are small and their last tensors are zero-size as per e.g. ca9f55f79d/aten/src/ATen/native/cuda/MultiTensorApply.cuh (L100-L102)
which was introduced in 05943712a4.

This PR special cases the last zero-size tensors so that we won't be negligent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100811
Approved by: https://github.com/ngimel
2023-05-12 20:26:45 +00:00
Jane Xu
a53cda1ddc [optim][BE] split test file into logical parts: SWA, LR, optim (#101100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101100
Approved by: https://github.com/albanD
2023-05-12 16:41:44 +00:00