This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)
We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them.
In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs.
Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow.
Test Plan:
```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 24 tests in 86.790s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759
Approved by: https://github.com/jansel, https://github.com/eellison
Summary: The codegen of `with torch.cuda._DeviceGuard` context manager in the Python wrapper code is implemented via `device_cm_stack: contextlib.ExitStack()`. As the context managers in the stack are `code.indent()`, this means that the whole stack is unindented at once on `device_cm_stack.close()`. This becomes problematic when attempting to codegen indented code (e.g., for control flow in Python and / or nested subgraph codegen-ing).
In this PR, we refactor the device guard codegen-ing in Python by replacing the `device_cm_stack` by explicit indent and unindent calls for entering and exiting the `with torch.cuda._DeviceGuard` context manager. This allows for nested device guard context managers and better aligns with other indented codegen-ing intertwined with it (e.g., for nested subgraph codegen-ing).
This is necessary for the upcoming support for `torch.cond` (and other control flow operators) in Inductor. Before that, the only change in the Python wrapper codegen is that the `return outputs` is now happening outside the `with torch.cuda._DeviceGuard` context manager.
Test Plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119673
Approved by: https://github.com/peterbell10
Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x.
In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment.
Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`).
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view
...
----------------------------------------------------------------------
Ran 6 tests in 27.952s
OK (skipped=4)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649
Approved by: https://github.com/oulgen
This PR adds a new type of triton kernel in which data is persistent but the
reduction dimension is split over multiple blocks (up to the entire kernel).
though this is called a reduction dimension, in actuality we only support scans.
because of this limitation, i have to be able to block fusions of split scan
operations with reductions so chose to add a new `ir.SplitScan` node which
is identical but allows for differentiation in the scheduler.
The split scan kernel is also the first to require an additional workspace buffer
which is used to communicate between cuda blocks. this is slightly tricky as we
the exact scratch space requirement isn't known until the grid size is calculated.
here i workaround the issue by setting a minimum rblock size and always allocating
to the maximum possible grid size for a given input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992
Approved by: https://github.com/jansel
ghstack dependencies: #117991
## Problem
A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`.
Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below.
## Approach
Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`.
Differential Revision: D53367012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165
Approved by: https://github.com/aakhundov
Differential Revision: D53398312
## Problem
Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead.
```
# What we see
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...);
# What we want
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...)
```
## Approach
Use C++ wrapper's expression printer to handle this conversion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166
Approved by: https://github.com/aakhundov
Summary: generate_index_put_fallback currently generates something like the following,
```
AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)};
```
The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault.
Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963
Approved by: https://github.com/aakhundov
Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512)
### Why?
Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`.
To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints.
This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model.
### Test
```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
OK (skipped=3)
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols
# Before
Error: CUDA driver error: invalid argument
FAILED (errors=2, skipped=3)
# Now
OK (skipped=3)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.
The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.
Test Plan: Unit tests included in commit.
Differential Revision: D53274382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
Fixes https://github.com/pytorch/pytorch/issues/118129
Suppressions automatically added with
```
import re
with open("error_file.txt", "r") as f:
errors = f.readlines()
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
Summary:
### Context
It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`.
* First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()`
* Second in `self.codegen_kwargs()`.
When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration.
```
auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
...
// There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0.
// And there's no reference to tmp_tensor_handle_0.
// Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't
// automatically cleaned-up like RAIIAtenTensorHandle
CUdeviceptr var_6;
aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void**>(&var_6));
void* kernel_args_var_2[] = {..., &var_6, ...};
launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2);
```
### Solution
We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`.
Test Plan:
### Inspect device memory allocated
```
# Before diff
0 device memory 2048
1 device memory 2560
2 device memory 3072
3 device memory 3584
4 device memory 4096
5 device memory 4608
# With diff (memory usage doesn't grow)
0 device memory 1536
1 device memory 1536
2 device memory 1536
3 device memory 1536
4 device memory 1536
5 device memory 1536
```
Reviewed By: jingsh, tissue3
Differential Revision: D53190934
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569
Approved by: https://github.com/oulgen
Fixes https://github.com/pytorch/pytorch/issues/118129
Suppressions automatically added with
```
import re
with open("error_file.txt", "r") as f:
errors = f.readlines()
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.
The suppressions were added automatically with the following script generated by GPT-4:
```
import re
# Read the error file
with open("error_file.txt", "r") as f:
errors = f.readlines()
# Parse the lines with errors and error types
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
- Add `darwin` to the list of supported platform
- Add `#include <sstream>` to `aoti_runtime/model.h`
- Refactor Linux specific constant compilation logic to `_compile_consts_linux`
- Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library
- Patch file using magic to avoid converting bytes to large hexadecimal string
- Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition)
- Enable test_aot_inductor.py tests on MacOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076
Approved by: https://github.com/desertfire
ghstack dependencies: #118077
Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation.
Result:
```
python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper
Before: Ran 1 test in 32.843s
After: Ran 1 test in 26.229s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693
Approved by: https://github.com/jansel
Summary: The added test case ends up emitting an inductor IR as the buffer string, lets properly emit the buffer name instead.
Test Plan: added new test
Differential Revision: D52899373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117838
Approved by: https://github.com/aakhundov
For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time
- persistent reduction
- regular reduction
A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime.
Here I talk more about implementation details:
- Inductor maintains states for generating kernels. E.g. the wrapper code. After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart.
***There is one thing I need some comments from others***:
There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel. But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list. Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex.
I'm not sure if there is some easy and clean way to resolve this.
Testing command:
```
TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469
Approved by: https://github.com/jansel
Summary:
AOTInductor currently infer cuda device index by `cudaGetDevice()`. This assumes outer runtime calls `cudaSetDevice()` somewhere, before invoking AOTInductor run.
This diff adds an explicit argument for specifying target Device. e.g. compiled on "cuda:0", run on "cuda:1".
todo:
- Are the changes in interface.h BC breaking? as it changes the function signatures in .so file. Might just need introduce a new "Create" function.
Test Plan: CI
Differential Revision:
D52747132
Privacy Context Container: 368960445142440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117413
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend.
### Design
Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation **scattered** in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code.
For example:
2a44034895/torch/_inductor/codegen/wrapper.py (L487)2a44034895/torch/_inductor/codegen/triton.py (L1996)
So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility.
Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel