Starter version of automatic dynamic shapes for export.
Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining".
The usage for specifying `dynamic_shapes` is now:
```
AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static
None/int/STATIC -> static
Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified.
```
Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic:
- specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None})
```
Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic:
- The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related
```
Implementation details:
This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints.
For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}`
This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are:
```
None/unspecified: dynamic by default
Dim/DerivedDim: also a strict assertion
```
If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620
Approved by: https://github.com/avikchaudhuri
Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467
Approved by: https://github.com/wconstab
ghstack dependencies: #132691
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes#69538
Minimum example of usage:
```
import torch
if __name__ == '__main__':
spd = torch.rand(4, 3)
A = spd.T @ spd
b = torch.rand(3).to(torch.float64).cuda()
A = A.to_sparse_csr().to(torch.float64).cuda()
x = torch.linalg.solve(A, b)
print((A @ x - b).norm())
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn
Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Summary:
This PR adds in cuSPARSELt as a backend to PyTorch.
It is now possible to see if cuSPARSELt is available and the version if
it is with
```
torch.backends.cusparselt.is_available()
torch.backends.cusparselt.version()
```
Test Plan:
```
python test/test_sparse_semi_structured.py -k test_cusparselt_backend
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534
Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc.
Currently, if you click the link,
https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`.
Also,
`torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393
Approved by: https://github.com/janeyx99
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
The `LRScheduler` class provides methods to adjusts the learning rate during optimization (as updated in this PR). Also, as a note, all the classes of lr_scheduluer are already provided in the `How to adjust learning rate` section.
Fixes#127884
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133243
Approved by: https://github.com/janeyx99
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.
I annotated the PR with explanation of changes.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.
Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
There are still some differences between CUDA and non-CUDA custom devices when
construct FSDP because CUDA is selected as the default device. For example,
when construct FSDP from CPU model and device_id is not passed, device_handle
will choose CUDA as default device. This PR will autoselect the real device
as the default device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609
Approved by: https://github.com/awgu
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.
I annotated the PR with explanation of changes.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.
Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
#109581
At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor
Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```
The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
[ 0.0762539, 0.5377805],
[ 0.2459473, 0.4662207]])
# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
[ 0.0762507, 0.5377818],
[ 0.2459457, 0.4662213]])
# keras
array([[ 0.38076326, -0.39121315],
[ 0.0762547 , 0.5377859 ],
[ 0.24594972, 0.46622536]], dtype=float32)
```
This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.
<details>
Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing
My script repro:
```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)
weight = torch.tensor([[ 0.37697506, -0.39500135],
[ 0.07246649, 0.53399765],
[ 0.24216151, 0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)
weight.grad = torch.tensor([[-0.5940447, -0.7743838],
[-0.5940447, -0.7743838],
[-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32)
weight_c = weight.clone()
weight_c.grad = weight.grad.clone()
optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)
optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```
<details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
Fixes#130284Fixes#130653
- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519