This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning.
Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116
Approved by: https://github.com/jansel, https://github.com/desertfire
**Summary**
Enable the qlinear weight prepack when input dimension size exceeds 2. There are extra reshape node before and after the `addmm` or `mm` node if input dimension size exceeds 2.
**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k input_dim_exceeds_2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113928
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #113733, #113912
**Summary**
In the previous QLinear implementation, it was assumed that inputs have a dimension of 2. In this update, we have modified QLinear to accept inputs with a dimension greater than 2, incorporating input and output reshaping accordingly.
**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113733
Approved by: https://github.com/jgong5, https://github.com/eellison
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.
Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.
Fixes https://github.com/pytorch/pytorch/issues/93631
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
Summary: When the model takes no inputs, AOTInductor relies on checking weights to figure out which device to compile the model into. Currently recording buffer device type happens too late, and this PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114682
Approved by: https://github.com/chenyang78
`ir.Wait` generates the last 2 lines of this code:
```python
buf1_work = dist.all_gather_into_tensor(buf1[0], buf1_inputs[0], async_op=True, group=buf1_pg)
fun_col_impl._register_tensor_work(buf1, buf1_work)
buf2 = buf1[0]
del buf1
buf2 = _wait_tensor(buf2) # <- generated by ir.Wait
buf3 = buf2; # reuse <- generated by ir.Wait
```
`_wait_tensor` technically is a "mutation" op that changes `buf2` in place. So we should mark `ir.Wait` as a mutation op (by overriding its `get_mutation_names()`).
This fixes a very peculiar issue when inductor comm reordering is used for llama model: downstream nodes that uses the all-gather comm output sometimes takes dependency on `buf2` (the node before `ir.Wait`) instead of on `buf3` (`ir.Wait`) (it's still unclear why it behaves like this). To work around the issue, we add the missing annotation that `buf3` is a mutation of `buf2`, so that the scheduler knows to schedule `buf3` before any of the `buf2` users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115104
Approved by: https://github.com/wanchaol
torch.split(x, l) fails when l's shape is the unbacked symint.
E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.
Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
This reduces compile time of Adam on 1k parameters from 180s to 140s (28%), the main reason being that thousands of buffers no longer get sent to the scheduler.
The idea behind this is that if a destination buffer (from a copy_) has no users, it shouldn't matter if dst aliases src.
This is implemented by reinplacing copy_ nodes when safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112440
Approved by: https://github.com/jansel
Summary: `None` arguments are codegened as `*i8` in the `triton_meta` of the generated or user-defined Triton kernels:
85aa372374/torch/_inductor/codegen/triton_utils.py (L33-L36)
Due to this, in contrary to the conventional Triton, we actually should pass `nullptr` to the Triton kernels in C++ wrapper codegen instead of passing nothing (as normally `None` doesn't make it to the generated PTX parameters, just like `tl.constexpr` args).
This PR adds two things:
1. Proper C++ wrapper codegening (ABI and non-ABI) of `nullptr` and `c10::nullopt`, as the prior way codegening `c10::nullopt` as tensor breaks (also `c10` breaks in the ABI mode).
2. Skipping `tl.constexpr` args when calling the loaded-from-cubin compiled Triton kernel in the C++ wrapper codegen. As a side effect, this also resolves an issue with string arguments: now they are simply omitted in the C++ wrapper codegen.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_with_none_input
...
----------------------------------------------------------------------
Ran 4 tests in 40.364s
OK (skipped=2)
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114475
Approved by: https://github.com/oulgen
torch.split(x, l) fails when l's shape is the unbacked symint.
E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.
Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
This needs some special handling because we don't actually allocate
boolean symbols in sympy; we allocate an integer indicator variable.
See comment for more details.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114157
Approved by: https://github.com/ydwu4
Previously it lacked a type hint and so was treated as an Any type. This
resulted in a lot of untyped code downstream as V.graph is referenced in
many places in inductor code. I've typed it properly now as
GraphLowering, and fixed the numerous type errors this surfaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025
Approved by: https://github.com/eellison
ghstack dependencies: #114013
Summary:
As discussed in the meeting, we are inverting the policy on the use of proxy executor for aten fallbacks.
By default, aten fallback ops will use proxy executor, unless a c-shim is available.
Test Plan: CIs
Differential Revision: D51417683
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113918
Approved by: https://github.com/chenyang78
I used a couple of type-ignore comments in ir.py because it constructs
short-lived instances of FixedLayout and GraphModuleSerializer, just to
call a single method on them that doesn't use all their members. Making
those unused members optional would make the rest of the code a lot
messier with sprinkled `assert` statements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113534
Approved by: https://github.com/albanD
In our first implemenation of the sdpa shim api, we didn't consider
the case where the optional scale argument could be None. It was
unnoticed because we always got a default argument for the cuda backend.
The issue was detected with the cpu backend.
This PR implements versioning for shim kernels. Currently, we only
have different versions for the sdpa api. We expect we would only
maintain a very small number of abi-compatible shim APIs that
had different versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113487
Approved by: https://github.com/int3, https://github.com/desertfire
This PR adds Inductor support for [native c10d_functional ops](https://github.com/pytorch/pytorch/pull/110570).
The Inductor IRs introduced in this PR will replace the existing `CollectiveKernel` IR hierarchy. Compared to the existing collective IRs, the new IRs:
- Are target language agnostic and support AOTInductor.
- Express the constraints solely with read/write deps. This maximizes the potential for buffer reuse.
- Address an issue where out-of-place collective's input buffers could be mutated while being volatilely read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112439
Approved by: https://github.com/Chillee
This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358
It turns out that we need to add support for optional types in order to
support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface
can't support optional<> directly, I am passing in optional types via
pointer instead.
`AtenTensorHandle`s are already pointers, so nothing needs to change
there. Only value types need to change.
We decided on this approach instead of adding an extra `bool` param to
the callee because this simplifies things. Having the same number of
arguments regardless of whether we are emitting Python / C++ /
ABI-compatible C++ makes codegen easier.
There are a number of existing ABI-compatible functions that have
optional-typed value parameters. Previously, they just assumed they
would never be passed a `nullopt` / `None` at runtime. Changing them to
use pointer types now would break ABI stability, so I have created an
exclude list for those functions.
Finally, I think the current implementation is kind of messy, and only
works for FallbackKernels, even though technically ExternKernels could
also have the same issue. It also doesn't support optional types nested
in lists. I've left FIXME comments for both issues.
Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Summary:
This PR adds epilogue fusion code generation support for the new experimental
[Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]).
Details:
A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template
and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and
performs the computation of the fused Pointwise / Elementwise computation nodes.
This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu),
which is currently the only documentation and example of Cutlass Epilogue Visitor Trees.
This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform.
A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments
to each of the functor subexpressions.
Step 1 functionality:
* End to end code generation is possible using the above approach.
* Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants )
after a matmul.
* Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc.
* Examples / Unit tests include ReLU and ReLU6 fusion.
* Support for fp16 and fp16 with fp32 accumulation data types.
* Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 )
The following is not yet supported, and is left for future work:
* Full operation support ( e.g. full set of all ops usually handled via V.ops handlers )
* Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented )
* Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature
* Add support for additional (auxiliary) outputs ( requires support for full computation graphs )
* Add support for reduction operations and operations which use different output layouts than the input
* Add support for additional dtypes ( as far as Cutlass allows )
This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features
for the inductor backend.
See also Cutlass release notes:
https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2
Notable changes in Cutlass 3.2.1 include:
* Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to
prevent namespace clashes without resolving to monkey-patching ( which was done earlier ).
* Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet )
* Small API changes to the cutlass_library API ( requires adapting the inductor backend code )
Notable changes in Cutlass 3.2.2 include:
* Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention
Test Plan:
* CI
* pytest test/inductor/test_max_autotune.py
Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled.
Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890
Approved by: https://github.com/jansel
ghstack dependencies: #112762