This PR aims for parity+ compared to the old testing for the simplest foreach test case.
Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.
Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok
----------------------------------------------------------------------
Ran 1 test in 7.253s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
----------------------------------------------------------------------
Ran 22 tests in 30.954s
OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```
Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.
Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
Introduce OptimizerInfos + use them to refactor out the error testing.
Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.
What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).
Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)
Possible but not urgent next step: test ALL possible error cases by going through all the constructors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
- Deletes unused kwargs
- Make test names more descriptive to remove need of comments. Overall it's better to codify over comment
- Added a test for duplicate params across groups
- Greatly simplified test_empty_grad to discover that the crux of the bug was NOT its emptiness, but rather with multi-dim emptiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101004
Approved by: https://github.com/albanD
This PR proposes an optimized way to do Exponential Moving Average (EMA), which is faster than the current way using `swa_utils.AveragedModel` described in https://pytorch.org/docs/stable/optim.html#custom-averaging-strategies.
This implementation is asynchronous, and is built as an optimizer wrapper so that the EMA weight update happens without any additional CPU/GPU sync, just after optimizer steps, and with limited code changes.
Example usage:
```
model = Model().to(device)
opt = torch.optim.Adam(model.parameters())
opt = EMAOptimizer(opt, device, 0.9999)
for epoch in range(epochs):
training_loop(model, opt)
regular_eval_accuracy = evaluate(model)
with opt.swap_ema_weights():
ema_eval_accuracy = evaluate(model)
```
Here are some benchmarks (time per iteration) on various torchvision models:
|model|this PR iteration time |swa_utils.AveragedModel iteration time| iteration speedup |
|-----|-----------------------------|-----------------------|---------------------------------------------|
| | | | |
|regnet_x_1_6gf|62.73 |67.998 |1.08 |
|regnet_x_3_2gf|101.75 |109.422 |1.08 |
|regnet_x_400mf|25.13 |32.005 |1.27 |
|regnet_x_800mf|33.01 |37.466 |1.13 |
|regnet_x_8gf|128.13 |134.868 |1.05 |
|regnet_y_16gf|252.91 |261.292 |1.03 |
|regnet_y_1_6gf|72.14 |84.22 |1.17 |
|regnet_y_3_2gf|99.99 |109.296 |1.09 |
|regnet_y_400mf|29.53 |36.506 |1.24 |
|regnet_y_800mf|37.82 |43.634 |1.15 |
|regnet_y_8gf|196.63 |203.317 |1.03 |
|resnet101|128.80 |137.434 |1.07 |
|resnet152|182.85 |196.498 |1.07 |
|resnet18|29.06 |29.975 |1.03 |
|resnet34|50.73 |53.443 |1.05 |
|resnet50|76.88 |80.602 |1.05 |
|resnext101_32x8d|277.29 |280.759 |1.01 |
|resnext101_64x4d|269.56 |281.052 |1.04 |
|resnext50_32x4d|100.73 |101.102 |1.00 |
|shufflenet_v2_x0_5|10.56 |15.419 |1.46 |
|shufflenet_v2_x1_0|13.11 |18.525 |1.41 |
|shufflenet_v2_x1_5|18.05 |23.132 |1.28 |
|shufflenet_v2_x2_0|25.04 |30.008 |1.20 |
|squeezenet1_1|14.26 |14.325 |1.00 |
|swin_b|264.52 |274.613 |1.04 |
|swin_s|180.66 |188.914 |1.05 |
|swin_t|108.62 |112.632 |1.04 |
|swin_v2_s|220.29 |231.153 |1.05 |
|swin_v2_t|127.27 |133.586 |1.05 |
|vgg11|95.52 |103.714 |1.09 |
|vgg11_bn|106.49 |120.711 |1.13 |
|vgg13|132.94 |147.063 |1.11 |
|vgg13_bn|149.73 |165.256 |1.10 |
|vgg16|158.19 |172.865 |1.09 |
|vgg16_bn|177.04 |192.888 |1.09 |
|vgg19|184.76 |194.194 |1.05 |
|vgg19_bn|203.30 |213.334 |1.05 |
|vit_b_16|217.31 |219.748 |1.01 |
|vit_b_32|69.47 |75.692 |1.09 |
|vit_l_32|223.20 |258.487 |1.16 |
|wide_resnet101_2|267.38 |279.836 |1.05 |
|wide_resnet50_2|145.06 |154.918 |1.07 |
You can see that in all cases it is faster than using `AveragedModel`. In fact in many cases, adding EMA does not add any overhead since the computation is hidden behind the usual iteration flow.
This is a similar implementation to the one currently in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
If the team is interested in merging this, let me know and I'll add some documentation similar to `swa_utils` and tests.
Credits to @szmigacz for the implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94820
Approved by: https://github.com/janeyx99
Fixes#95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.
Related #94060
I forgot about this wrong handling after #94344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95847
Approved by: https://github.com/janeyx99
Big OOP correction continued. Also added a test this time to verify the defaulting was as expected.
The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95820
Approved by: https://github.com/albanD
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
Old behavior would have adadelta foreach sending tensors to the slow path if they were not all the same dtype nor on the same device.
This PR adds grouping for adadelta optimizer so that it would run foreach in batches, allowing more users to benefit from foreach perf.
Of course, we should ensure that the new implementation works, so there are new tests to ensure this behavior is not broken.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92048
Approved by: https://github.com/albanD
I realized test_fused_optimizers used a helper that was written for foreach, so we were not testing fused at all. This PR fixes that test so we actually test fused adam.
The explicitly adding fused=False is to set the stage for my later changes (but should be a no-op here).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91228
Approved by: https://github.com/albanD, https://github.com/soulitzer
Fixes#84053
As described in the issue, the AveragedModel will deep copy the model during initialization, which means that the buffers in the averaged model cannot be updated together with the model.
One solution is to make the buffers equal to the source model every time when calling `update_parameters`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84054
Approved by: https://github.com/samdow
Hi, we noticed in our team that by using CyclicLR, there is a problem with memory clearance on GPU (probably it will be the case without the GPU as well, but that was our use case) After initializing CyclicLR, GPU memory is not cleared even after the model, optimizer and scheduler are out of scope (e.g. reference count is zero). This is because `__init__` method inside `CyclicLR` creates reference to its own methods and it will not get removed until `gc.collect()` is called manually. This is a problem if people want to test multiple models in one run of a script, after testing the first model, second one will fail on `CUDA out of memory error` because the first one is not cleared from the memory.
I propose a simple fix by using `weakref`, similarly as in `_LRScheduler` base class, but if you have any comments I am happy to change it.
Here is the code to reproduce the bug:
```
import torch
import weakref
from transformers import DetrForObjectDetection
class X:
def __init__(self, optimizer):
self.optimizer = optimizer
# Will cause cyclic reference.
self.func = self.dummy
# Will work as expected, memory cleared after instance count is zero.
# self.func = weakref.WeakMethod(self.dummy)
def dummy(self, x):
return 1.
def test():
model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')
model.to('cuda')
optimizer = torch.optim.Adam(model.parameters())
x = X(optimizer)
test()
print(f'{torch.cuda.memory_reserved()}, {torch.cuda.memory_allocated()}') # Should print (<some memory>, 0), but with cyclic reference, it will print (<some memory>, <some memory>).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85462
Approved by: https://github.com/albanD