Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize.
This PR introduces two tests to encompass coverage:
1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case.
2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does.
Together, these tests do a better job of testing param groups than today's tests, **though we do lose some flavors**. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees.
The leftover param group configs are used in conjunction with LRSchedulers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675
Approved by: https://github.com/albanD
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
- `assert(a == b)` -> `assert a == b`
- `if(x > y or y < z):`->`if x > y or y < z:`
- And `return('...')` -> `return '...'`
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group`
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...
----------------------------------------------------------------------
Ran 3 tests in 0.023s
OK
```
Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised.
The new test_errors does not run slower than before, overhead is still king:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..........................
----------------------------------------------------------------------
Ran 26 tests in 10.337s
OK
```
Compared to test_errors BEFORE my commit :p
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.............sssssssssssss
----------------------------------------------------------------------
Ran 26 tests in 11.980s
OK (skipped=13)
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315
Approved by: https://github.com/mikaylagawarecki
As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable.
This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried:
Old:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
OK
```
New (includes the old test):
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........................
----------------------------------------------------------------------
Ran 27 tests in 0.456s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023, #115025
Removing 4 tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 4 tests in 22.731s
OK
```
For the same 4 but more granular:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
....
test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 50 tests in 50.785s
OK (skipped=25)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023
Replace the following:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 38.599s
OK
```
with 11 tests (one for each foreach optim :))
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........
----------------------------------------------------------------------
Ran 11 tests in 39.293s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023
Approved by: https://github.com/albanD
ghstack dependencies: #114802
New tests look like:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 2 tests in 34.591s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py
-v -k test_set_default_dtype_works_with_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
...
test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 32.915s
OK (skipped=11)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802
Approved by: https://github.com/albanD
This PR aims for parity+ compared to the old testing for the simplest foreach test case.
Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.
Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 1 test in 7.253s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 30.954s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.
Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
Introduce OptimizerInfos + use them to refactor out the error testing.
Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.
What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).
Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)
Possible but not urgent next step: test ALL possible error cases by going through all the constructors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
Context:
https://github.com/pytorch/pytorch/pull/47724 fixed the problem that SparseAdam could not handle generators by using the `list(...)` construct. However, this meant that SparseAdam deviated from other optimizers in that it could _accept_ a raw Tensors/Parameter vs requiring a container of them. This is not really a big deal.
So why this PR?
I do think this PR is cleaner. It uses the fact that the Optimizer parent class already containerizes parameters into parameter groups, so we could reuse that here by calling `super().__init__` first and then filter the param_groups after. This change would also make SparseAdam consistent with the rest of our optimizers in that only containerized params are accepted, which technically is BC breaking SO I've added a deprecation warning that we should remove in May 2024.
(But is it really BC breaking when we've said in the docs that params should be an iterable this whole time? Maybe this is just a bug fix....😛)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114425
Approved by: https://github.com/drisspg
Fixes#107282
## Overview
- basic design decision was followed as they made on #103881 (tensor operation, test cases, order & position of argument etc.)
- for the algorithm for decoupled weight decay, I referred to [1, 2]
## backwards-incompatible changes
- positional argument `decoupled_weight_decay` is added to:
- `torch.optim.radam`
The existing code which refers to these APIs can be affected.
Note: Positional argument `decoupled_weight_decay` is added to `torch.optim.RAdam`. However, since it was added to the last position and with default value, it is not affected.
## Reference
- [1] [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)
- [2] https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam/radam.py#L5-L94
## TODO
- [x] implement tensor operation
- [x] implement test cases
- [x] modify doc-string
- [x] pass unit test code locally `python test/test_optim.py -k test_radam`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107507
Approved by: https://github.com/janeyx99
Moves the logic to casting state to match parameters into a hook so that users can choose to enable their hooks before or after the casting has happened.
With this, there is a little bit of redundancy of the id_map building and the check that the param groups are still aligned in length.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106725
Approved by: https://github.com/albanD
Starts addressing #106802
This PR also conveniently does some BE:
- Fixes a bug in adamw where we use amsgrad instead of per group amsgrad
- Brings the impls of adamw and adam closer to correctness and to each other
I couldn't fully remove the .pyi's because mypy was going to complain about the entire files which scared me and shouldn't go in this PR anyway.
Test plan:
- Add tests to ensure that lr could be passed as a Tensor
- Did some profiling of the below code (runs 1k iterations of step for Adam)
```
import torch
from torch.testing._internal.common_utils import TestCase
param = torch.rand(2, 3, dtype=torch.float, device='cuda:0', requires_grad=True)
param.grad = torch.rand_like(param)
lr = torch.tensor(.001, device='cuda:0')
opt = torch.optim.Adam([param], lr=lr, fused=True)
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
]
) as p:
for _ in range(1000):
opt.step()
print(p.key_averages().table(sort_by="cpu_time_total"))
```
Before my change:
<img width="1381" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/cfc5175a-0f41-4829-941f-342554f3b152">
After my change (notice there are no d2h syncs and the CPU time is lower!):

Next steps long term:
- have all capturable foreach + forloop impls in Adam(W) handle tensor LR
- have all capturable impls handle tensor LR
- have all impls handle tensor LR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106916
Approved by: https://github.com/albanD
Forward fixes https://github.com/pytorch/pytorch/pull/106615 by increasing tolerance in the test.
The capturable implementation for foreach simply varies due to a different order of operations when updating params. I had also attempted to compare against fp64 but that introduced more disparity in the other optimizer configs. It is worth trying the fp64 comparison at a later point, but let's get the test passing first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106887
Approved by: https://github.com/izaitsevfb
This PR:
- adds a capturable API for NAdam similar to Adam(W)
- adds tests accordingly
- discovered and fixed bugs in the differentiable implementation (now tested through the capturable codepath).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106615
Approved by: https://github.com/albanD
The thing we do still deep copy is the param_groups, which is much lighter weight. This should also save memory when loading from a checkpoint.
The deepcopy was introduced in ecfcf39f30, but module.py had only a shallow copy at that point so it did not actually bring parity.
Incorporates an XLA fix, which is why I'm updating the pin to ca5eab87a7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106082
Approved by: https://github.com/albanD, https://github.com/Skylion007