pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Michael Lazos	d88fcb86d8	Enable dynamo traced test_forloop_goes_right_direction (#123322 ) Removed a bunch of skips, I also updated test_forloop_goes_right_direction to not use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322 Approved by: https://github.com/janeyx99 ghstack dependencies: #123498	2024-04-18 00:50:10 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
Jane Xu	3346ec8263	[BE] Document what is tested in TestOptim (#123853 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123853 Approved by: https://github.com/soulitzer	2024-04-12 19:59:29 +00:00
Michael Lazos	2ac99d539b	Only initialize state if needed in SGD (#123757 ) Fixes [T184381726](https://www.internalfb.com/intern/tasks/?t=184381726) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123757 Approved by: https://github.com/janeyx99	2024-04-11 08:56:06 +00:00
Yifu Wang	eb3a34d280	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-04-03 05:54:49 +00:00
Jane Xu	d7fe0603a1	Move sparse tests to TestOptimRenewed (#123146 ) This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params. 1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present. 2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for: Adagrad: ``` python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'maximize': True, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'lr': 0.1} <--- this and above are CPU .{'foreach': False, 'lr': 0.1} {'foreach': True, 'lr': 0.1} {'maximize': True, 'foreach': False, 'lr': 0.1} {'maximize': True, 'foreach': True, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1} {'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1} . ---------------------------------------------------------------------- Ran 2 tests in 227.744s OK ``` SGD ``` (pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'dampening': 0.5, 'lr': 0.0048} .{'foreach': False, 'lr': 0.0048} {'foreach': True, 'lr': 0.0048} {'dampening': 0.5, 'foreach': False, 'lr': 0.0048} {'dampening': 0.5, 'foreach': True, 'lr': 0.0048} . ---------------------------------------------------------------------- Ran 2 tests in 112.801s OK ``` SparseAdam ``` (pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( {'maximize': True, 'lr': 0.04} .{'maximize': True, 'lr': 0.04} . ---------------------------------------------------------------------- Ran 2 tests in 35.113s OK ``` Fixes #103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed. Next steps: Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality. Add more tests! Maybe add more optimizers! Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146 Approved by: https://github.com/albanD ghstack dependencies: #123134, #123139	2024-04-02 22:51:02 +00:00
Jane Xu	f2838c99a0	Add a tensor lr test for optimizers (#123139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123139 Approved by: https://github.com/albanD ghstack dependencies: #123134	2024-04-02 22:51:02 +00:00
Jane Xu	cb8fc30e4a	Move LRScheduler integration tests to OptimizerInfo (#123134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123134 Approved by: https://github.com/albanD	2024-04-02 22:51:02 +00:00
Jane Xu	9d9d2af786	[BE] Move tests using functional API to OptimizerInfo (#122822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122822 Approved by: https://github.com/albanD	2024-04-02 01:35:59 +00:00
Jane Xu	fb1d7935bb	[optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618 Approved by: https://github.com/albanD	2024-03-12 02:33:21 +00:00
Jane Xu	f76e541ec7	[BE] NO MORE discrepancy between forloop foreach capturable YAY (#121269 ) and I will not let it happen again Pull Request resolved: https://github.com/pytorch/pytorch/pull/121269 Approved by: https://github.com/albanD ghstack dependencies: #121260, #121264	2024-03-08 00:00:30 +00:00
Jane Xu	24821fec26	Add RAdam capturable API for forloop (#121260 ) Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260 Approved by: https://github.com/mlazos	2024-03-08 00:00:30 +00:00
Jane Xu	53bdae736d	Add capturable single tensor Adamax (#121183 ) Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183 Approved by: https://github.com/albanD	2024-03-07 17:57:02 +00:00
Mikayla Gawarecki	d621e3e3b8	Add exhaustive module and optimizer tests for torch.load(state_dict, weights_only=True) (#121049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121049 Approved by: https://github.com/janeyx99	2024-03-05 14:27:50 +00:00
Jane Xu	059994d2b7	Migrate load_state_dict hook tests to OptimizerInfo (#119310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119310 Approved by: https://github.com/albanD ghstack dependencies: #119283, #119288, #119299, #119308	2024-02-07 16:00:01 +00:00
Jane Xu	0320e62255	Migrate test_state_dict hooks to OptimizerInfo (#119308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119308 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283, #119288, #119299	2024-02-07 16:00:01 +00:00
Jane Xu	3625ccfbea	Move step global hooks test to OptimizerInfo (#119299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119299 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283, #119288	2024-02-07 15:50:31 +00:00
Jane Xu	7b3762e6bc	Move step pre/post hook tests to OptimizerInfo (#119288 ) Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though! With the torch cuda synchronization: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .................................................... ---------------------------------------------------------------------- Ran 52 tests in 13.680s OK ``` Excluding the torch cuda synchronization: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .................................................... ---------------------------------------------------------------------- Ran 52 tests in 1.038s OK ``` The old tests: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .. ---------------------------------------------------------------------- Ran 2 tests in 0.518s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283	2024-02-07 15:50:31 +00:00
Jane Xu	f85b0ea8bb	Migrate last lbfgs test over to OptimizerInfo (#119283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119283 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki	2024-02-06 19:49:05 +00:00
Jane Xu	781f7c9080	[BE] Use OptimizerInfo step_requires_closure, only_supports_sparse_grads (#119230 ) So I had planned ahead of time to use these but forgot to actually use them when migrating tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119230 Approved by: https://github.com/albanD	2024-02-06 00:13:43 +00:00
Jane Xu	b5ba80828f	[optim] Rectify capturable testing and fix bugs! (#118326 ) This PR fixes several bugs, listed in priority: 1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed. 2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks 3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos 4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place. 5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected. The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device. Details for posterity: 4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct. ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={}, desc=default params=None, kwargs={'lr': 0.01}, desc=non-default lr params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad params=None, kwargs={'capturable': True}, desc=capturable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad . ---------------------------------------------------------------------- Ran 1 test in 19.229s OK ``` 5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct. ``` /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={'differentiable': False}, desc=default params=None, kwargs={'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable .params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused . ---------------------------------------------------------------------- Ran 2 tests in 11.112s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326 Approved by: https://github.com/mlazos	2024-02-02 19:13:00 +00:00
PyTorch MergeBot	2964170f3a	Revert "[optim] Rectify capturable testing and fix bugs! (#118326 )" This reverts commit `d947b9d500`. Reverted https://github.com/pytorch/pytorch/pull/118326 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there are some relevant failures in trunk `d947b9d500`, may be a land race ([comment](https://github.com/pytorch/pytorch/pull/118326#issuecomment-1923125676))	2024-02-02 07:08:14 +00:00
Jane Xu	d947b9d500	[optim] Rectify capturable testing and fix bugs! (#118326 ) This PR fixes several bugs, listed in priority: 1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed. 2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks 3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos 4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place. 5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected. The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device. Details for posterity: 4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct. ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={}, desc=default params=None, kwargs={'lr': 0.01}, desc=non-default lr params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad params=None, kwargs={'capturable': True}, desc=capturable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad . ---------------------------------------------------------------------- Ran 1 test in 19.229s OK ``` 5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct. ``` /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={'differentiable': False}, desc=default params=None, kwargs={'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable .params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused . ---------------------------------------------------------------------- Ran 2 tests in 11.112s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326 Approved by: https://github.com/mlazos	2024-02-02 02:02:58 +00:00
Felix Zimmermann	aca41a3a74	[optim] lbfgs: handle complex params as independent real params (#118184 ) Ref: #86340 Fixes #118148 This fixes LBFGS for complex parameters. Complex parameters are handled as R^2. I also added a test, unfortunately, due to the closure required, I could not use the existing `_test_complex_optimizer` used for all other optimizers. Lbfgs is special, as it will call the objective function multiple times internally. So I felt making a one-off test for lbfgs might be justifiable. We will test if each step taken internally by the optimizer is the same for R^2 and complex parameters. Let me know if the approach is ok, thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/118184 Approved by: https://github.com/janeyx99	2024-01-31 19:24:16 +00:00
Michael Lazos	800e2e823f	Add compilable foreach RAdam support (#117912 ) Fixes https://github.com/pytorch/pytorch/issues/117807 This brings the number of supported optimizers with `torch.compile` to 11/13 (!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912 Approved by: https://github.com/janeyx99	2024-01-27 04:32:27 +00:00
Jane Xu	17ecd1e9cd	Migrate test_complex_optimizer to OptimizerInfo (#118160 ) This PR does what it says and more. 1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about. 2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense. 3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160 Approved by: https://github.com/mikaylagawarecki	2024-01-24 21:22:47 +00:00
Jane Xu	fc30c4d769	Migrate forloop directional tests to OptimizerInfo (#117410 ) This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`. Changes in coverage? Yes! - This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test. - This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed. What will it take to fully remove `test_basic_cases`? - We need to flavor the tests with LRSchedulers - Testing for param groups --> which all just distinguish between lrs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410 Approved by: https://github.com/albanD	2024-01-24 01:28:40 +00:00
Jane Xu	c6be5d55a5	Migrate param_group testing to OptimizerInfo (#117675 ) Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize. This PR introduces two tests to encompass coverage: 1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case. 2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does. Together, these tests do a better job of testing param groups than today's tests, though we do lose some flavors. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees. The leftover param group configs are used in conjunction with LRSchedulers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675 Approved by: https://github.com/albanD	2024-01-22 23:48:46 +00:00
Jane Xu	95a6866220	Migrate fused optim load_state_dict to OptimizerInfo (#117890 ) The new tests look like: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (`29f899ef`)]$ python test/test_optim.py -v -k test_cpu_load_state_dict /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( test_cpu_load_state_dict_impl_capturable_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_capturable_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_capturable_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... skipped 'SGD does not currently support capturable' test_cpu_load_state_dict_impl_fused_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_fused_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_fused_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 12 tests in 12.865s OK (skipped=6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117890 Approved by: https://github.com/albanD	2024-01-22 21:14:38 +00:00
Jane Xu	c329eddcb9	Migrate the rest of state_dict testing to OptimizerInfo (#117186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117186 Approved by: https://github.com/albanD ghstack dependencies: #116509	2024-01-12 22:32:37 +00:00
Jane Xu	bcf1f312a0	Migrate nontensor step and CUDA params state_dict tests to OptimizerInfo (#116509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116509 Approved by: https://github.com/albanD	2024-01-12 22:32:37 +00:00
Jane Xu	90df7c008a	Migrate state_dict bc test to OptimizerInfo, increase coverage (#116500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116500 Approved by: https://github.com/albanD	2024-01-10 08:19:27 +00:00
Jane Xu	4af1c27fa8	Migrate repr, deterministic state_dict test to OptimizerInfo (#116496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116496 Approved by: https://github.com/albanD ghstack dependencies: #116471	2023-12-28 19:49:04 +00:00
Jane Xu	f3c4395358	[BE] Add helper in common_optimizers to get all optim inputs (#116471 ) This will be a common utility in test_optim.py. Printing out the optimizer inputs when using this helper looks reasonable: For local test plan, click below. <details> ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d186986c)]$ python test/test_optim.py -vv -k test_step_is_noop_when_params_have_no_grad test_step_is_noop_when_params_have_no_grad_ASGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0 params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Adadelta_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable ok test_step_is_noop_when_params_have_no_grad_Adagrad_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable ok test_step_is_noop_when_params_have_no_grad_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable ok test_step_is_noop_when_params_have_no_grad_Adamax_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_LBFGS_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok test_step_is_noop_when_params_have_no_grad_NAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RMSprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Rprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SparseAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok test_step_is_noop_when_params_have_no_grad_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0 params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable ok test_step_is_noop_when_params_have_no_grad_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused ok test_step_is_noop_when_params_have_no_grad_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused ok test_step_is_noop_when_params_have_no_grad_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_LBFGS_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_step_is_noop_when_params_have_no_grad_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable ok test_step_is_noop_when_params_have_no_grad_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable ok test_step_is_noop_when_params_have_no_grad_SparseAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 26 tests in 19.089s OK ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116471 Approved by: https://github.com/albanD	2023-12-28 19:49:04 +00:00
Jane Xu	924f1b841a	[optim] Allow torch.float64 scalars for forloop + foreach implementations (#115841 ) Should allow for uses cases mentioned in #110940 This would allow scalars to also be float64s in the foreach implementation. The fused implementation would still create a float32 step on Adam and AdamW. This PR also does NOT worry about performance and is mainly for enablement. Next steps: - Relax the constraint on fused adam(w) and allow torch.float64 scalars there - Allow _performant_ mixed dtypes in foreach (a bigger project in itself). This PR will conflict with my other PRs, I will figure out a landing order Pull Request resolved: https://github.com/pytorch/pytorch/pull/115841 Approved by: https://github.com/albanD	2023-12-27 09:13:49 +00:00
Jane Xu	44b98c09ca	[BE] migrate all assertRaises tests to OptimizerInfo test_errors (#116315 ) Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group` ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (`d2d129de`)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" ... ---------------------------------------------------------------------- Ran 3 tests in 0.023s OK ``` Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised. The new test_errors does not run slower than before, overhead is still king: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (`d2d129de`)]$ python test/test_optim.py -k test_errors /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .......................... ---------------------------------------------------------------------- Ran 26 tests in 10.337s OK ``` Compared to test_errors BEFORE my commit :p ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (`b47aa696`)]$ python test/test_optim.py -k test_errors /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .............sssssssssssss ---------------------------------------------------------------------- Ran 26 tests in 11.980s OK (skipped=13) (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (`b47aa696`)]$ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315 Approved by: https://github.com/mikaylagawarecki	2023-12-27 00:08:31 +00:00
Jane Xu	edf1ea622d	Move step is noop tests (#115299 ) As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable. This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried: Old: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" . ---------------------------------------------------------------------- Ran 1 test in 0.028s OK ``` New (includes the old test): ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" ........................... ---------------------------------------------------------------------- Ran 27 tests in 0.456s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299 Approved by: https://github.com/albanD ghstack dependencies: #114802, #115023, #115025	2023-12-20 22:49:44 +00:00
Jane Xu	8f3a0594e9	Move tests depending on listed configs to OptimizerInfo (#115025 ) Removing 4 tests: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok ---------------------------------------------------------------------- Ran 4 tests in 22.731s OK ``` For the same 4 but more granular: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' .... test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 50 tests in 50.785s OK (skipped=25) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025 Approved by: https://github.com/albanD ghstack dependencies: #114802, #115023	2023-12-20 22:49:44 +00:00
Jane Xu	05d60931b3	Migrate test_peak_mem_multi_tensor_optimizers to OptimizerInfo (#115023 ) Replace the following: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" . ---------------------------------------------------------------------- Ran 1 test in 38.599s OK ``` with 11 tests (one for each foreach optim :)) ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" ........... ---------------------------------------------------------------------- Ran 11 tests in 39.293s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023 Approved by: https://github.com/albanD ghstack dependencies: #114802	2023-12-20 22:49:44 +00:00
Jane Xu	4fb92b591d	[BE] remove redundant _test_derived_optimizers by migrating more to OptimizerInfo (#114802 ) New tests look like: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 2 tests in 34.591s OK (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k test_set_default_dtype_works_with_foreach /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' ... test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 22 tests in 32.915s OK (skipped=11) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802 Approved by: https://github.com/albanD	2023-12-20 22:49:44 +00:00
Jane Xu	056a882cb9	add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947 ) fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-12-16 01:33:32 +00:00
Jane Xu	21cca2494d	Move test_multi_tensor_optimizers to use OptimizerInfos (#114797 ) This PR aims for parity+ compared to the old testing for the simplest foreach test case. Test coverage increase: we now test foreach optimizers with CPU as well as on GPU. Before: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok ---------------------------------------------------------------------- Ran 1 test in 7.253s OK (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ ``` Now, we get granular test cases at the cost of overhead! ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 22 tests in 30.954s OK (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ ``` Why the increase in time? Two reasons: 1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does. 2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before. Next steps for the future: - consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case) - this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797 Approved by: https://github.com/albanD	2023-12-07 19:37:56 +00:00
Jane Xu	d78fe039eb	Introduce OptimizerInfos + add a test_errors (#114178 ) Introduce OptimizerInfos + use them to refactor out the error testing. Why OptimizerInfos? - cleaner, easier way to test all configs of optimizers - would plug in well with devicetype to auto-enable tests for devices like MPS, meta - would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more. What did I do for error testing? - I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test. - The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that. - TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device). Is there any change in test coverage? - INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS - DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments) Possible but not urgent next step: test ALL possible error cases by going through all the constructors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178 Approved by: https://github.com/albanD	2023-12-05 22:58:36 +00:00
Jane Xu	a53cda1ddc	[optim][BE] split test file into logical parts: SWA, LR, optim (#101100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101100 Approved by: https://github.com/albanD	2023-05-12 16:41:44 +00:00
Jane Xu	cb94ea6044	[BE] Simplify tests, elaborate testnames in test_optim.py (#101004 ) - Deletes unused kwargs - Make test names more descriptive to remove need of comments. Overall it's better to codify over comment - Added a test for duplicate params across groups - Greatly simplified test_empty_grad to discover that the crux of the bug was NOT its emptiness, but rather with multi-dim emptiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101004 Approved by: https://github.com/albanD	2023-05-11 23:27:24 +00:00
Jane Xu	d63e0b1578	[optim] More cleanup and reorg of test_optim.py (#100917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100917 Approved by: https://github.com/albanD	2023-05-09 21:03:15 +00:00
Jane Xu	d0dab772df	[BE][optim] Remove objects from being globals and comment to clarify (#100899 ) What happened in this PR? 1. Added comments to clarify rosenbrock 2. Moved global objects to be within classes for better readability/grouping 3. Renamed dnn to cnn for consistency This is the very first of the cleanup of test_optim.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/100899 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-05-09 21:03:15 +00:00
Jane Xu	f558af2a55	[adam] Use the right params in weight_decay, rename for clarity, fixes #100707 (#100973 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100973 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-05-09 17:00:27 +00:00
milesial	45bf3f6216	Optimized EMA implementation (#94820 ) This PR proposes an optimized way to do Exponential Moving Average (EMA), which is faster than the current way using `swa_utils.AveragedModel` described in https://pytorch.org/docs/stable/optim.html#custom-averaging-strategies. This implementation is asynchronous, and is built as an optimizer wrapper so that the EMA weight update happens without any additional CPU/GPU sync, just after optimizer steps, and with limited code changes. Example usage: ``` model = Model().to(device) opt = torch.optim.Adam(model.parameters()) opt = EMAOptimizer(opt, device, 0.9999) for epoch in range(epochs): training_loop(model, opt) regular_eval_accuracy = evaluate(model) with opt.swap_ema_weights(): ema_eval_accuracy = evaluate(model) ``` Here are some benchmarks (time per iteration) on various torchvision models: \|model\|this PR iteration time \|swa_utils.AveragedModel iteration time\| iteration speedup \| \|-----\|-----------------------------\|-----------------------\|---------------------------------------------\| \| \| \| \| \| \|regnet_x_1_6gf\|62.73 \|67.998 \|1.08 \| \|regnet_x_3_2gf\|101.75 \|109.422 \|1.08 \| \|regnet_x_400mf\|25.13 \|32.005 \|1.27 \| \|regnet_x_800mf\|33.01 \|37.466 \|1.13 \| \|regnet_x_8gf\|128.13 \|134.868 \|1.05 \| \|regnet_y_16gf\|252.91 \|261.292 \|1.03 \| \|regnet_y_1_6gf\|72.14 \|84.22 \|1.17 \| \|regnet_y_3_2gf\|99.99 \|109.296 \|1.09 \| \|regnet_y_400mf\|29.53 \|36.506 \|1.24 \| \|regnet_y_800mf\|37.82 \|43.634 \|1.15 \| \|regnet_y_8gf\|196.63 \|203.317 \|1.03 \| \|resnet101\|128.80 \|137.434 \|1.07 \| \|resnet152\|182.85 \|196.498 \|1.07 \| \|resnet18\|29.06 \|29.975 \|1.03 \| \|resnet34\|50.73 \|53.443 \|1.05 \| \|resnet50\|76.88 \|80.602 \|1.05 \| \|resnext101_32x8d\|277.29 \|280.759 \|1.01 \| \|resnext101_64x4d\|269.56 \|281.052 \|1.04 \| \|resnext50_32x4d\|100.73 \|101.102 \|1.00 \| \|shufflenet_v2_x0_5\|10.56 \|15.419 \|1.46 \| \|shufflenet_v2_x1_0\|13.11 \|18.525 \|1.41 \| \|shufflenet_v2_x1_5\|18.05 \|23.132 \|1.28 \| \|shufflenet_v2_x2_0\|25.04 \|30.008 \|1.20 \| \|squeezenet1_1\|14.26 \|14.325 \|1.00 \| \|swin_b\|264.52 \|274.613 \|1.04 \| \|swin_s\|180.66 \|188.914 \|1.05 \| \|swin_t\|108.62 \|112.632 \|1.04 \| \|swin_v2_s\|220.29 \|231.153 \|1.05 \| \|swin_v2_t\|127.27 \|133.586 \|1.05 \| \|vgg11\|95.52 \|103.714 \|1.09 \| \|vgg11_bn\|106.49 \|120.711 \|1.13 \| \|vgg13\|132.94 \|147.063 \|1.11 \| \|vgg13_bn\|149.73 \|165.256 \|1.10 \| \|vgg16\|158.19 \|172.865 \|1.09 \| \|vgg16_bn\|177.04 \|192.888 \|1.09 \| \|vgg19\|184.76 \|194.194 \|1.05 \| \|vgg19_bn\|203.30 \|213.334 \|1.05 \| \|vit_b_16\|217.31 \|219.748 \|1.01 \| \|vit_b_32\|69.47 \|75.692 \|1.09 \| \|vit_l_32\|223.20 \|258.487 \|1.16 \| \|wide_resnet101_2\|267.38 \|279.836 \|1.05 \| \|wide_resnet50_2\|145.06 \|154.918 \|1.07 \| You can see that in all cases it is faster than using `AveragedModel`. In fact in many cases, adding EMA does not add any overhead since the computation is hidden behind the usual iteration flow. This is a similar implementation to the one currently in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). If the team is interested in merging this, let me know and I'll add some documentation similar to `swa_utils` and tests. Credits to @szmigacz for the implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94820 Approved by: https://github.com/janeyx99	2023-04-26 18:02:11 +00:00
Masaki Kozuki	22ea21da3d	Change 1D Tensor of 1 element to 0D Tensor (#96994 ) add 0d tensor to graph adam/adamw test Affected: - `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker` - `step` of Adam & AdamW of `capturable` Fixes #96776 🤞 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994 Approved by: https://github.com/janeyx99	2023-03-21 18:24:19 +00:00
David	e8b0f504e2	Fix unpicklable object in AveragedModel (#95979 ) Fixes #95376 Don't store the callable `avg_fn`, instead test if `avg_fn` is None and call the default impl if it's not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95979 Approved by: https://github.com/janeyx99	2023-03-12 05:13:22 +00:00
Masaki Kozuki	7d765cdc66	Fix wrong handling of `grad_scale` & `found_inf` in fused optimizers (#95847 ) Fixes #95781. The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf. Related #94060 I forgot about this wrong handling after #94344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95847 Approved by: https://github.com/janeyx99	2023-03-04 01:21:21 +00:00
Jane Xu	75cb99e549	[optim] Widen the cases for defaulting to foreach (#95820 ) Big OOP correction continued. Also added a test this time to verify the defaulting was as expected. The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95820 Approved by: https://github.com/albanD	2023-03-02 04:15:33 +00:00
Pearu Peterson	cece63f197	Add warn-once deprecation warning to legacy sparse constructors (#94850 ) Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-02-23 15:05:12 +00:00
kshitij12345	3b966a6ce3	[autograd] disable backward/grad for complex scalar output (#92753 ) Fixes https://github.com/pytorch/pytorch/issues/92750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92753 Approved by: https://github.com/ezyang	2023-02-23 11:38:27 +00:00
Masaki Kozuki	e0a954f531	call `zero_grad` in foreach/fused optimizers tests (#94724 ) the tests calling this method haven't failed because `iter` is a built-in function's name Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94724 Approved by: https://github.com/Skylion007	2023-02-15 04:14:34 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Aaron Gokaslan	9171f7d4cd	[BE] Modernize PyTorch even more for 3.8 with pyupgrade (#94520 ) Applies some more pyupgrade fixits to PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/94520 Approved by: https://github.com/ezyang	2023-02-10 18:02:50 +00:00
Aaron Gokaslan	1e2d82b8e4	[BE] Merge isinstance calls together (#94419 ) Simplify and speeds up isinstance calls by checking for multiple types at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419 Approved by: https://github.com/ezyang	2023-02-09 00:47:26 +00:00
Masaki Kozuki	6ba041fcae	Look up `group["capturable"]`, not `defaults["capturable"]` in Adam(W) (#94149 ) We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g. https://github.com/pytorch/pytorch/issues/89987. So check whether or not `capturable` is `True` among all the `param_group`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149 Approved by: https://github.com/albanD	2023-02-07 00:24:35 +00:00
Masaki Kozuki	a23ed38f9a	[mta][foreach] Implement fused adamw (#88015 ) related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015 Approved by: https://github.com/albanD, https://github.com/ngimel	2023-02-01 19:32:29 +00:00
Jane Xu	de0375e79d	[optim][foreach] Do NOT inplace modify gradients (#92706 ) SGD and ASGD already had out-of-place grads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92706 Approved by: https://github.com/ngimel, https://github.com/albanD	2023-01-21 00:12:28 +00:00
Jane Xu	2b885e1f6c	[optim][NAdam] Fix discrepancy between mt vs st impl (#92699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92699 Approved by: https://github.com/albanD	2023-01-21 00:12:28 +00:00
Jane (Yuan) Xu	3ba5eae72a	[optim][radam] fix eps discrepancy for foreach (#92551 ) Will likely race with https://github.com/pytorch/pytorch/pull/92365 eps was not being used at all in the mta/foreach impl. There was also a discrepancy between the docs vs the implementation: the implementation was doing sqrt(x) + eps and the docs were doing sqrt(x+eps)). I've fixed the docs + extended the current multi_tensor test case to capture this issue. ![image](https://user-images.githubusercontent.com/31798555/213300617-61cbb763-da2d-48e0-b3b6-0190594dd049.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92551 Approved by: https://github.com/albanD	2023-01-19 14:38:59 +00:00
Jane Xu	4af5939d7a	[optim] Improve adadelta foreach, group tensors to maximize fast path (#92048 ) Old behavior would have adadelta foreach sending tensors to the slow path if they were not all the same dtype nor on the same device. This PR adds grouping for adadelta optimizer so that it would run foreach in batches, allowing more users to benefit from foreach perf. Of course, we should ensure that the new implementation works, so there are new tests to ensure this behavior is not broken. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92048 Approved by: https://github.com/albanD	2023-01-14 00:35:14 +00:00
PyTorch MergeBot	7f2b5ea1e1	Revert "Avoid device casting for all singleton tensors in optimizer states (#91454 )" This reverts commit `1e725c9747`. Reverted https://github.com/pytorch/pytorch/pull/91454 on behalf of https://github.com/janeyx99 due to Likely caused regression where checkpoint resume fails during training	2023-01-10 18:57:50 +00:00
Joel Schlosser	1e725c9747	Avoid device casting for all singleton tensors in optimizer states (#91454 ) Fixes #75224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91454 Approved by: https://github.com/janeyx99	2023-01-04 17:55:00 +00:00
Adrian Wälchli	f5e20d6060	Make the state dict of CyclicLR scheduler pickleable (#91400 ) Fixes #90414 This PR drops the unpicklable `weakref.WeakMethod` object from CyclicLR scheduler from the state dict, and re-inits the object again once the state dict gets loaded. This makes the state picklable so you can include it in your checkpoint. Also fixes https://github.com/Lightning-AI/lightning/issues/15901 A simple test was added that `pickle.dumps(state)` the state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91400 Approved by: https://github.com/albanD	2022-12-28 18:05:24 +00:00
Jane Xu	e3383d296f	[optim][fix] test_fused_optimizers did not test fused before (#91228 ) I realized test_fused_optimizers used a helper that was written for foreach, so we were not testing fused at all. This PR fixes that test so we actually test fused adam. The explicitly adding fused=False is to set the stage for my later changes (but should be a no-op here). Pull Request resolved: https://github.com/pytorch/pytorch/pull/91228 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-12-21 19:42:24 +00:00
Michael Lazos	1accd915a4	Re-enable optimizers (#90709 ) Fixes https://github.com/pytorch/pytorch/issues/90165 https://github.com/pytorch/torchdynamo/issues/328 Re-enables optimizer capture + compilation now that the dynamo slowdowns have been fixed and it has speedups, numbers to come soon Pull Request resolved: https://github.com/pytorch/pytorch/pull/90709 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/yanboliang	2022-12-19 04:07:41 +00:00
Anupam Bhatnagar	6f4dea562d	Implement post and pre hooks for optimizer (#89176 ) Fixes #88446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89176 Approved by: https://github.com/albanD	2022-12-02 07:03:45 +00:00
Michael Lazos	903ae4570e	Disable optimizer tracing, enable for tests only (#89500 ) Disabling optimizer tracing before launch until it can be added to the benchmark suites without increasing compile times Pull Request resolved: https://github.com/pytorch/pytorch/pull/89500 Approved by: https://github.com/anijain2305	2022-11-24 04:15:34 +00:00
Jane Xu	0a69c50a46	Publicly expose _LRScheduler to LRScheduler (#88503 ) Fixes #61232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88503 Approved by: https://github.com/soulitzer	2022-11-07 21:15:10 +00:00
Philip Meier	bc73affdad	prepare removal of deprecated functionality in torch.testing (#87969 ) _Redo of #86586 with all BC breaking changes granularly placed into separate commits._ --- Per title. Deprecation happened on Feb 25, 2022 in `c6f1bbc0ac`, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969 Approved by: https://github.com/mruberry	2022-11-02 14:04:48 +00:00
RangiLyu	512a3a48e3	sync AveragedModel buffers when use_buffers=False (#84054 ) Fixes #84053 As described in the issue, the AveragedModel will deep copy the model during initialization, which means that the buffers in the averaged model cannot be updated together with the model. One solution is to make the buffers equal to the source model every time when calling `update_parameters`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84054 Approved by: https://github.com/samdow	2022-10-24 16:03:14 +00:00
Emilio Castillo	1b43883fd6	Make `AdamW`, `NAdam` & `RAdam` differentiable (#86183 ) Blocked by #86096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86183 Approved by: https://github.com/albanD	2022-10-17 04:32:08 +00:00
Catherine Lee	d29c8c0ffa	enable optim tests on dynamo to test flaky bot (#86976 ) will link the issue that disabled them if this gets approved Pull Request resolved: https://github.com/pytorch/pytorch/pull/86976 Approved by: https://github.com/albanD	2022-10-14 21:44:13 +00:00
mikael10j	7dcfbedce0	Fix LinearLR scheduler start_factor (#86695 ) Fixes #86454 The `start_factor` must be comprised in ]0;1] instead of [0;1] to avoid division by 0. This PR changes the lower limit checking of the parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86695 Approved by: https://github.com/albanD	2022-10-13 17:31:36 +00:00
Emilio Castillo	cb4867a71a	Make `ASGD` & `RProp` differentiable (#86258 ) Blocked by #86183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86258 Approved by: https://github.com/albanD	2022-10-13 04:06:13 +00:00
Emilio Castillo	aacb9f3ac6	Make `Adadelta`,`Adagrad` & `Adamax` differentiable (#86096 ) Continuing the differentiable optimizers support Pull Request resolved: https://github.com/pytorch/pytorch/pull/86096 Approved by: https://github.com/janeyx99	2022-10-12 23:16:29 +00:00
Nikita Shulga	9eb4f9dd17	Tweak test tolerances to be compatible with A10G (#86538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538 Approved by: https://github.com/ngimel	2022-10-11 23:31:48 +00:00
albanD	a079dad7cf	Skip dynamo for all optim test as they are all flaky otherwise (#86482 ) Fixes https://github.com/pytorch/pytorch/issues/86433 Fixes https://github.com/pytorch/pytorch/issues/86435 Fixes https://github.com/pytorch/pytorch/issues/86432 Fixes https://github.com/pytorch/pytorch/issues/86389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86482 Approved by: https://github.com/ezyang	2022-10-07 22:47:48 +00:00
kshitij12345	82229d1e33	[optim] fix: empty grad support for SparseAdam (#86459 ) Fixes #82486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86459 Approved by: https://github.com/albanD	2022-10-07 19:24:59 +00:00
Check Deng	b3fdb02fb2	Fix memory leak in _LRScheduler.step() (#85602 ) Fixes #85410 This diff removed the cyclic references in `_LRScheduler.step()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602 Approved by: https://github.com/albanD	2022-10-07 15:55:55 +00:00
PyTorch MergeBot	233d6f195a	Revert "Fix memory leak in _LRScheduler.step() (#85602 )" This reverts commit `eb32330d6b`. Reverted https://github.com/pytorch/pytorch/pull/85602 on behalf of https://github.com/albanD due to newly added test is flaky	2022-10-06 22:02:02 +00:00
Chengqi Deng	eb32330d6b	Fix memory leak in _LRScheduler.step() (#85602 ) Fixes #85410 This diff removed the cyclic references in `_LRScheduler.step()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602 Approved by: https://github.com/albanD	2022-10-06 17:07:36 +00:00
Masaki Kozuki	4c04fa9587	Remove `optim_mt` from `test/test_optim.py` (#83549 ) As per title, this updates `test_optim.py` so that `foreach` optimizers are constructed using the `foreach` keyword argument of `torch.optim` optimizers. Also, this makes some cosmetic changes to remove `torch.autograd.Variable`, `.data` calls, and `torch._six`. Related: https://github.com/pytorch/pytorch/pull/81705#discussion_r939440776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83549 Approved by: https://github.com/ngimel	2022-09-30 20:32:05 +00:00
Masaki Kozuki	5f26df0345	resubmit: "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" (#85739 ) Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](`849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)`) to a new header file and let FusedAdam use them to tame MSVC, hopefully. cc @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739 Approved by: https://github.com/ngimel	2022-09-29 16:58:59 +00:00
Peter Jung	9f1468ae6c	CyclicLR memory leak fix (#85462 ) Hi, we noticed in our team that by using CyclicLR, there is a problem with memory clearance on GPU (probably it will be the case without the GPU as well, but that was our use case) After initializing CyclicLR, GPU memory is not cleared even after the model, optimizer and scheduler are out of scope (e.g. reference count is zero). This is because `__init__` method inside `CyclicLR` creates reference to its own methods and it will not get removed until `gc.collect()` is called manually. This is a problem if people want to test multiple models in one run of a script, after testing the first model, second one will fail on `CUDA out of memory error` because the first one is not cleared from the memory. I propose a simple fix by using `weakref`, similarly as in `_LRScheduler` base class, but if you have any comments I am happy to change it. Here is the code to reproduce the bug: ``` import torch import weakref from transformers import DetrForObjectDetection class X: def __init__(self, optimizer): self.optimizer = optimizer # Will cause cyclic reference. self.func = self.dummy # Will work as expected, memory cleared after instance count is zero. # self.func = weakref.WeakMethod(self.dummy) def dummy(self, x): return 1. def test(): model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50') model.to('cuda') optimizer = torch.optim.Adam(model.parameters()) x = X(optimizer) test() print(f'{torch.cuda.memory_reserved()}, {torch.cuda.memory_allocated()}') # Should print (<some memory>, 0), but with cyclic reference, it will print (<some memory>, <some memory>). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/85462 Approved by: https://github.com/albanD	2022-09-27 17:41:58 +00:00
PyTorch MergeBot	7167996346	Revert "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" This reverts commit `4615d1bcfa`. Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds	2022-09-27 16:59:35 +00:00
Masaki Kozuki	4615d1bcfa	resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507 Approved by: https://github.com/ngimel	2022-09-23 18:56:00 +00:00
PyTorch MergeBot	e505360eb8	Revert "[mta] APEX style Fused Adam (#81705 )" This reverts commit `7a6c4d0c50`. Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come	2022-09-22 19:37:29 +00:00
Masaki Kozuki	7a6c4d0c50	[mta] APEX style Fused Adam (#81705 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel	2022-09-20 17:18:33 +00:00
kshitij12345	faac3dbce2	[optim] asgd : handle complex params as independent real params (#84472 ) Ref: #65711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84472 Approved by: https://github.com/Lezcano, https://github.com/soulitzer	2022-09-06 16:58:42 +00:00
Chen, Jian Ping	e72256604f	Enhance add_out_dense_sparse_cpu for hybrid sparse tensor (#23057 ) This is to improve the performance for hybrid sparse coo tensor on CPU path. This case is appeared at the DLRM terabyte test. With this fix, according to the previous performance test data, it got ~10x performance improvement on DLRM execution. without this, the DLRM will run as Finished training it 100/1000 of epoch 0, 2969.25 ms/it, loss 0.220505, accuracy 0.000 % with this, the DLRM will run as Finished training it 100/1000 of epoch 0, 270.71 ms/it, loss 0.220505, accuracy 0.000 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/23057 Approved by: https://github.com/VitalyFedyunin, https://github.com/malfet	2022-08-24 22:42:53 +00:00
kshitij12345	7c20ad3dfa	[optim] rprop: handle complex params as independent real params (#83858 ) Ref #65711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858 Approved by: https://github.com/albanD	2022-08-23 08:39:35 +00:00
Kshiteej K	09331c947c	[optim] rmsprop: handle complex params as independent real params (#83860 ) Ref: #65711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83860 Approved by: https://github.com/albanD	2022-08-22 21:55:01 +00:00
Emilio Castillo	f0eb841d20	Make `torch.optim.RMSprop` differentiable (#83578 ) Blocked by #82205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83578 Approved by: https://github.com/albanD	2022-08-22 03:37:10 +00:00
Emilio Castillo	5aab57e112	Make Adam optimizer differentiable (#82205 ) Continues [80938](https://github.com/pytorch/pytorch/pull/80938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82205 Approved by: https://github.com/albanD	2022-08-17 07:20:37 +00:00

1 2 3 4 5 ...

309 Commits