Changes:
1. Move `polyfill.py` -> `polyfills/__init__.py`. It can be used as `polyfill.xxx` -> `polyfills.xxx`.
2. Move submodule loading from `polyfills/__init__.py` to `polyfills/loader.py`.
Merge `polyfill.py` and `polyfills/` packages. Each polyfill module have its own namespace for better code organization.
The ultimate goal is make `polyfills/__init__.py` empty and all polyfill functions move to its own namespace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133977
Approved by: https://github.com/jansel
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
I didn't test this path when creating the orchestrator. This PR fixes
that path to work in the capture_triton path. The problem is that we are
handling a value that is an int (in the capture_triton path) and a
ConstantVariable (in the Dynamo triton path) so we abstract that out in
the orchestrator.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132143
Approved by: https://github.com/oulgen
TritonKernelVariable's logic tells us how to go from a user-defined
triton kernel and a grid to a call to the triton_kernel_wrapper_mutation
HOP. We want to re-use this in a setting without Dynamo; in the next PR
up, we create a new decorator (capture_triton) that, when applied to a
triton kernel, transforms a call to the triton kernel into a call
to the triton_kernel_wrapper_mutation HOP.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177
Approved by: https://github.com/oulgen, https://github.com/ydwu4
Hard to write tests. This PR makes many test pass in the stack such as
`PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_ao_sparsity.py::TestComposability::test_convert_without_squash_mask`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129858
Approved by: https://github.com/mlazos
ghstack dependencies: #129830
Significant bytecode generation API change!
The new suggested convention to generating bytecode to call a function is now to wrap instructions that push a callable to the stack with `add_push_null`, then that callable is called with `create_call_function` with `push_null=False` (see diff for examples).
In Python 3.13, NULL is now expected to be pushed after the callable. In <=3.12, the NULL was pushed before the callable. This change abstracts away the exact placement of the NULL, but the developer must be aware that a NULL may be needed when codegen'ing a callable.
This abstraction also reduces the need for the `push_null=True` option in `create_call_function`, which removes the need to rotate a NULL to the right place on the stack with a sequence of `SWAP` instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129172
Approved by: https://github.com/jansel
Improve Dynamo to support the FSDP2 `use_training_state()` context manager.
Test command:
`
pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_dynamo_trace_use_training_state
`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127854
Approved by: https://github.com/yanboliang
This is a short-term fix (for 2.4). In the longer term we should
fix https://github.com/pytorch/pytorch/issues/128430
The problem is that warnings.warn that are inside Dynamo print
all the time. Python warnings are supposed to print once, unless their
cache is reset: Dynamo ends up resetting that cache everytime it runs.
As a workaround we provide our own warn_once cache that is keyed on the
warning msg. I am not worried about this increasing memory usage because
that's effectively what python's warnings.warn cache does.
Test Plan:
- fix tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456
Approved by: https://github.com/anijain2305
Fixes https://github.com/pytorch/pytorch/issues/122404
Previously, when rewriting c10d collectives, if the group argument is
unspecified or None, we create a world pg variable out of thin air and
pass it to the rewrite target. The approach was problematic, as it
assumes the symbol `torch` is available in the scope (see #122404).
After #120560, dynamo can now trace dist.group.WORLD. If the group
argument is unspecified, we can just set it with dist.group.WORLD in the
rewrite target.
Testing
pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce
Also verified with the repro provided in #122404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561
Approved by: https://github.com/wconstab
ghstack dependencies: #120560
Summary: Special kwargs like `num_warps`, `num_stages`, and `num_ctas` can be passed to the Triton kernel call as kwargs. These kwargs are handled in a special way, not being passed to the underlying kernel function directly. In this PR, we move those special kwargs from `kwargs` of the `TritonKernelVariable` in dynamo to `Autotuner`'s `Config` instances (either already existing or newly created for this purpose). As a result, the special kwargs can be codegened correctly as a part of `Config`, not as direct arguments to the kernel `.run`.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_kwargs
...
----------------------------------------------------------------------
Ran 6 tests in 6.783s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122280
Approved by: https://github.com/oulgen
@ezyang mentioned that we should not put constant args on the graph. Especially when there are args that would be trickier to put on the graph. E.g. next PR needs `triton.language.dtype` as an argument on the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122140
Approved by: https://github.com/jansel