Third PR in a series of PRs to broaden differentiable optimizer support w/ @janeyx99 (sorry for pinging over the holidays! I just wanted to put this one out but I am definitely not asking for review or anything like that rn)
This is also going to probably be my last PR before the holidays!
Note: This is a branch of #143710 -- I've never worked on a branch of a branch before so I wasn't sure about the protocol so I thought I'd just made the PR and wait until that one gets merged.
This is adding support for differentiable lr, weight_decay, and betas to Adam and AdamW (but after refactoring AdamW into an Adam subclass, it's really just changing code in torch/optim/adam.py)
I had one main thing I was wondering about, which is that adam already has a differentiable flag built in, so I have code like this
```py
if differentiable and isinstance(beta2, Tensor):
if beta2.requires_grad:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
else:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
else:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
That I could definitely simplify to just
```py
if differentiable and isinstance(beta2, Tensor):
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
else:
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
It would definitely be a little slower in the case that it's differentiable but doesn't need a grad for beta2, but the code would also be a lot more clear and I'm debating speed vs future code usability.
Also the line in the above example:
```py
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
```
was concerning to me because it is considerably more expensive than `value=1 - beta2`, but I couldn't think of a better way to do it.
Further work on #141832
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143726
Approved by: https://github.com/janeyx99
(Actual) second PR in a larger project to broaden support for differentiable optimizers with @janeyx99!
In this PR, I did a lot of pattern matching from the previous PR to add support for differentiable weight_decay.
And also added a single new line on line 359 (previously line 352) to make the code from the last PR a little easier to read
Continuation of progress on #141832
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143679
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143510
Approved by: https://github.com/janeyx99
This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params.
1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present.
2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for:
Adagrad:
```
python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'lr': 0.1} <--- this and above are CPU
.{'foreach': False, 'lr': 0.1}
{'foreach': True, 'lr': 0.1}
{'maximize': True, 'foreach': False, 'lr': 0.1}
{'maximize': True, 'foreach': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1}
.
----------------------------------------------------------------------
Ran 2 tests in 227.744s
OK
```
SGD
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
{'dampening': 0.5, 'lr': 0.0048}
.{'foreach': False, 'lr': 0.0048}
{'foreach': True, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': False, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': True, 'lr': 0.0048}
.
----------------------------------------------------------------------
Ran 2 tests in 112.801s
OK
```
SparseAdam
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.04}
.{'maximize': True, 'lr': 0.04}
.
----------------------------------------------------------------------
Ran 2 tests in 35.113s
OK
```
Fixes#103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed.
Next steps:
Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality.
Add more tests!
Maybe add more optimizers!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146
Approved by: https://github.com/albanD
ghstack dependencies: #123134, #123139
Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though!
With the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 13.680s
OK
```
Excluding the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 1.038s
OK
```
The old tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..
----------------------------------------------------------------------
Ran 2 tests in 0.518s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283
This PR does what it says and more.
1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about.
2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense.
3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160
Approved by: https://github.com/mikaylagawarecki
This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`.
**Changes in coverage?** Yes!
- This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test.
- This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed.
What will it take to fully remove `test_basic_cases`?
- We need to flavor the tests with LRSchedulers
- Testing for param groups --> which all just distinguish between lrs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410
Approved by: https://github.com/albanD
Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize.
This PR introduces two tests to encompass coverage:
1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case.
2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does.
Together, these tests do a better job of testing param groups than today's tests, **though we do lose some flavors**. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees.
The leftover param group configs are used in conjunction with LRSchedulers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675
Approved by: https://github.com/albanD
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
- `assert(a == b)` -> `assert a == b`
- `if(x > y or y < z):`->`if x > y or y < z:`
- And `return('...')` -> `return '...'`
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group`
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...
----------------------------------------------------------------------
Ran 3 tests in 0.023s
OK
```
Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised.
The new test_errors does not run slower than before, overhead is still king:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..........................
----------------------------------------------------------------------
Ran 26 tests in 10.337s
OK
```
Compared to test_errors BEFORE my commit :p
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.............sssssssssssss
----------------------------------------------------------------------
Ran 26 tests in 11.980s
OK (skipped=13)
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315
Approved by: https://github.com/mikaylagawarecki
As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable.
This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried:
Old:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
OK
```
New (includes the old test):
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........................
----------------------------------------------------------------------
Ran 27 tests in 0.456s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023, #115025
Removing 4 tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 4 tests in 22.731s
OK
```
For the same 4 but more granular:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
....
test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 50 tests in 50.785s
OK (skipped=25)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023
Replace the following:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 38.599s
OK
```
with 11 tests (one for each foreach optim :))
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........
----------------------------------------------------------------------
Ran 11 tests in 39.293s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023
Approved by: https://github.com/albanD
ghstack dependencies: #114802
New tests look like:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 2 tests in 34.591s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py
-v -k test_set_default_dtype_works_with_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
...
test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 32.915s
OK (skipped=11)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802
Approved by: https://github.com/albanD
This PR aims for parity+ compared to the old testing for the simplest foreach test case.
Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.
Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 1 test in 7.253s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 30.954s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.
Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
Introduce OptimizerInfos + use them to refactor out the error testing.
Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.
What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).
Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)
Possible but not urgent next step: test ALL possible error cases by going through all the constructors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
Context:
https://github.com/pytorch/pytorch/pull/47724 fixed the problem that SparseAdam could not handle generators by using the `list(...)` construct. However, this meant that SparseAdam deviated from other optimizers in that it could _accept_ a raw Tensors/Parameter vs requiring a container of them. This is not really a big deal.
So why this PR?
I do think this PR is cleaner. It uses the fact that the Optimizer parent class already containerizes parameters into parameter groups, so we could reuse that here by calling `super().__init__` first and then filter the param_groups after. This change would also make SparseAdam consistent with the rest of our optimizers in that only containerized params are accepted, which technically is BC breaking SO I've added a deprecation warning that we should remove in May 2024.
(But is it really BC breaking when we've said in the docs that params should be an iterable this whole time? Maybe this is just a bug fix....😛)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114425
Approved by: https://github.com/drisspg
Fixes#107282
## Overview
- basic design decision was followed as they made on #103881 (tensor operation, test cases, order & position of argument etc.)
- for the algorithm for decoupled weight decay, I referred to [1, 2]
## backwards-incompatible changes
- positional argument `decoupled_weight_decay` is added to:
- `torch.optim.radam`
The existing code which refers to these APIs can be affected.
Note: Positional argument `decoupled_weight_decay` is added to `torch.optim.RAdam`. However, since it was added to the last position and with default value, it is not affected.
## Reference
- [1] [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)
- [2] https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam/radam.py#L5-L94
## TODO
- [x] implement tensor operation
- [x] implement test cases
- [x] modify doc-string
- [x] pass unit test code locally `python test/test_optim.py -k test_radam`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107507
Approved by: https://github.com/janeyx99