Commit Graph

280 Commits

Author SHA1 Message Date
Colin Peppler
3829b55416 [inductor] Support ProxyExecutor argument codegen for sympy.Expr (#119166)
Differential Revision: D53398312

## Problem
Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead.

```
# What we see
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...);

# What we want
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...)
```

## Approach
Use C++ wrapper's expression printer to handle this conversion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166
Approved by: https://github.com/aakhundov
2024-02-06 00:33:25 +00:00
Bin Bao
c7ba5f6c6f [AOTI] Fix a cpp kernel missing arg type issue (#119021)
Summary: The current way of fetching the kernel arg types only works for tensors, not symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119021
Approved by: https://github.com/aakhundov, https://github.com/hl475, https://github.com/khabinov
2024-02-02 20:11:58 +00:00
Bin Bao
0e5fe4b3ae [AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963)
Summary: generate_index_put_fallback currently generates something like the following,

```
AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)};
```

The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault.

Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963
Approved by: https://github.com/aakhundov
2024-02-02 16:49:45 +00:00
Colin Peppler
babd6c776d [inductor] skip launching kernels with zero grid in AOTInductor when using backed symints (#118654)
Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512)

### Why?

Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`.

To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints.

This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model.

### Test

```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
OK (skipped=3)

$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols

# Before
Error: CUDA driver error: invalid argument
FAILED (errors=2, skipped=3)

# Now
OK (skipped=3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-02-02 03:19:52 +00:00
Mu-Chu Lee
2b48891e62 [AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765)
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.

The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.

Test Plan: Unit tests included in commit.

Differential Revision: D53274382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
2024-02-01 04:54:25 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
Jason Ansel
e332653eb3 [inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490
Approved by: https://github.com/desertfire
2024-01-30 21:03:19 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Colin Peppler
8be6dee14b [inductor] Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569)
Summary:
### Context

It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`.
* First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()`
* Second in `self.codegen_kwargs()`.

When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration.
```
auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
...
// There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0.
// And there's no reference to tmp_tensor_handle_0.
// Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't
// automatically cleaned-up like RAIIAtenTensorHandle
CUdeviceptr var_6;
aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void**>(&var_6));
void* kernel_args_var_2[] = {..., &var_6, ...};
launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2);
```

### Solution
We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`.

Test Plan:
### Inspect device memory allocated
```
# Before diff
0 device memory 2048
1 device memory 2560
2 device memory 3072
3 device memory 3584
4 device memory 4096
5 device memory 4608

# With diff (memory usage doesn't grow)
0 device memory 1536
1 device memory 1536
2 device memory 1536
3 device memory 1536
4 device memory 1536
5 device memory 1536
```

Reviewed By: jingsh, tissue3

Differential Revision: D53190934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569
Approved by: https://github.com/oulgen
2024-01-30 05:19:32 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Edward Z. Yang
2951bbf0f7 Add some type annotations to torch._inductor.codegen.wrapper (#118491)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118491
Approved by: https://github.com/Skylion007
2024-01-29 06:17:27 +00:00
Edward Z. Yang
cad79bd0bb Remove follow_imports = skip from sympy (#118469)
dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.

The suppressions were added automatically with the following script generated by GPT-4:

```
import re

# Read the error file
with open("error_file.txt", "r") as f:
    errors = f.readlines()

# Parse the lines with errors and error types
error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
2024-01-28 13:38:38 +00:00
Bin Bao
4e456fd95b [AOTI] Support scalar to tensor in the ABI-compatible mode (#118024)
Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024
Approved by: https://github.com/ezyang
2024-01-26 03:15:05 +00:00
Jason Ansel
2de24c11f6 [inductor] Slightly faster memory allocation on CUDA (#118255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255
Approved by: https://github.com/peterbell10
ghstack dependencies: #118065, #118070, #118171
2024-01-25 20:49:14 +00:00
Bin Bao
476b744e23 [AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 (#118291)
Summary: https://github.com/pytorch/pytorch/pull/117989 disabled   use_thread_local_cached_output_tensor for cuda, but it is not necessarily true, because we can still have cpu tensors when running cuda models.

Differential Revision: D53089956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118291
Approved by: https://github.com/Skylion007, https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov
2024-01-25 20:30:17 +00:00
Jason Ansel
817debeb89 [inductor] Slightly faster memory allocation on CPU (#118171)
Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `12.2us`
- After `10.5us`

This is inspired by a2c17a2b00 -- but in Python rather than C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #118065, #118070
2024-01-25 16:54:57 +00:00
Bin Bao
ee1dbb2acf [AOTI] Fix a None as index codegen issue (#118187)
Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices.

Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187
Approved by: https://github.com/chenyang78
ghstack dependencies: #118168, #118169
2024-01-25 11:53:44 +00:00
Bin Bao
821b2c543c [AOTI] Support .item() in the ABI-compatible mode (#117989)
Summary:

Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989
Approved by: https://github.com/ezyang, https://github.com/chenyang78
2024-01-24 20:17:59 +00:00
Nikita Shulga
bd99115276 [AOTI] Enable for MacOS (#118076)
- Add `darwin` to the list of supported platform
- Add `#include <sstream>` to `aoti_runtime/model.h`
- Refactor Linux specific constant compilation logic to `_compile_consts_linux`
- Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library
   - Patch file using magic to avoid converting bytes to large hexadecimal string
- Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition)
- Enable test_aot_inductor.py tests on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076
Approved by: https://github.com/desertfire
ghstack dependencies: #118077
2024-01-24 14:24:05 +00:00
Bin Bao
41556324a9 [cpp_wrapper] Change CppWrapperCodeCache to use faster python binding (#117693)
Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation.

Result:
```
python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper

Before: Ran 1 test in 32.843s
After: Ran 1 test in 26.229s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693
Approved by: https://github.com/jansel
2024-01-21 16:07:52 +00:00
Adnan Akhundov
fbd1d567ed [inductor] Fix CPP wrapper codegen for ExternKernel args (#117931)
Summary: We see IR nodes `repr`-ed directly in the CPP wrapper codegen. Recently, this issue has been fixed for the Python wrapper codegen in D52899373 (https://github.com/pytorch/pytorch/pull/117838). Here we extend the fix to CPP wrapper codegen / AOTInductor.

Test Plan:
New unit tests. In OSS:

```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_multi_output_arg
```

```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_extern_kernel_arg
```

Differential Revision: D52936248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117931
Approved by: https://github.com/oulgen, https://github.com/chenyang78, https://github.com/desertfire
2024-01-21 04:58:56 +00:00
Oguz Ulgen
15d568d621 [Inductor] Use codegen reference for buffer to string (#117838)
Summary: The added test case ends up emitting an inductor IR as the buffer string, lets properly emit the buffer name instead.

Test Plan: added new test

Differential Revision: D52899373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117838
Approved by: https://github.com/aakhundov
2024-01-19 20:18:53 +00:00
Shunting Zhang
e432b2e607 [inductor] multi-kernel support (#103469)
For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time
- persistent reduction
- regular reduction

A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime.

Here I talk more about implementation details:
- Inductor maintains states for generating kernels. E.g. the wrapper code.  After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart.

***There is one thing I need some comments from others***:
There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel.  But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list.  Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex.

I'm not sure if there is some easy and clean way to resolve this.

Testing command:
```

TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469
Approved by: https://github.com/jansel
2024-01-18 23:16:31 +00:00
Sherlock Huang
89cf1ddb5c [AOTInductor] Allow user to explicitly specify Device to run on (#117413)
Summary:
AOTInductor currently infer cuda device index by `cudaGetDevice()`. This assumes outer runtime calls `cudaSetDevice()` somewhere, before invoking AOTInductor run.

This diff adds an explicit argument for specifying target Device. e.g. compiled on "cuda:0", run on "cuda:1".

todo:
- Are the changes in interface.h BC breaking? as it changes the function signatures in .so file. Might just need introduce a new "Create" function.

Test Plan: CI

Differential Revision:
D52747132

Privacy Context Container: 368960445142440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117413
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
2024-01-17 19:28:04 +00:00
Colin Peppler
4712c7dac8 [inductor] add C-shim for index_put (#116667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116667
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-01-16 20:29:14 +00:00
Shunting Zhang
04604eea8a [inductor] check nan/inf for graph inputs (#117189)
This is split out from #103469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117189
Approved by: https://github.com/jansel
2024-01-12 00:59:32 +00:00
Jason Ansel
94363cee41 [inductor] Indexing refactors (#116078)
Perf differences seems to be noise:
![image](https://github.com/pytorch/pytorch/assets/533820/d7a36574-0388-46e4-bd4d-b274d37cab2b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116078
Approved by: https://github.com/aakhundov
2024-01-09 19:06:51 +00:00
Oleg Khabinov
5377b994da [aot_inductor] Retrieve original FQNs for weights (#116157)
Differential Revision: D52303882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116157
Approved by: https://github.com/frank-wei
2024-01-05 21:30:36 +00:00
etaf
7a6cb9fdfb [Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020)
As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend.

### Design
Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation **scattered** in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code.
For example:
2a44034895/torch/_inductor/codegen/wrapper.py (L487)

2a44034895/torch/_inductor/codegen/triton.py (L1996)

 So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can  maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility.

Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-12-22 08:42:51 +00:00
Yifu Wang
7d0ad6e870 Make native c10d_functional ops work with AOTInductor (#113735)
Summary:
- Revised `c10d_functional` ops to conform to https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native#func
- Modifed `get_cpp_op_schema()` to handle mutable args and aliasing returns

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113735
Approved by: https://github.com/desertfire
ghstack dependencies: #113438
2023-12-22 08:12:13 +00:00
Shunting Zhang
99f7e721fe [inductor] make inductor work with new triton compile interface (#115878)
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.

Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.

Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
2023-12-22 00:09:29 +00:00
PyTorch MergeBot
db35ccf463 Revert "[innductor] make inductor work with new triton compile interface (#115878)"
This reverts commit bbded928b3.

Reverted https://github.com/pytorch/pytorch/pull/115878 on behalf of https://github.com/kit1980 due to Broke ROCm https://github.com/pytorch/pytorch/actions/runs/7282149837/job/19844618618 ([comment](https://github.com/pytorch/pytorch/pull/115878#issuecomment-1865369349))
2023-12-21 02:00:17 +00:00
Shunting Zhang
bbded928b3 [innductor] make inductor work with new triton compile interface (#115878)
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.

Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.

Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
2023-12-21 00:03:38 +00:00
Bin Bao
a597a00c87 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115972)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

This is a reland of https://github.com/pytorch/pytorch/pull/115831

Differential Revision: [D52290900](https://our.internmc.facebook.com/intern/diff/D52290900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115972
Approved by: https://github.com/chenyang78
2023-12-20 03:22:03 +00:00
Oguz Ulgen
c55210b4f0 [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-20 00:25:32 +00:00
PyTorch MergeBot
c539f7df10 Revert "[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)"
This reverts commit 21b8127f1c.

Reverted https://github.com/pytorch/pytorch/pull/115849 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, please check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/115849#issuecomment-1863012933))
2023-12-19 15:47:55 +00:00
PyTorch MergeBot
d5115bfb06 Revert "[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)"
This reverts commit 287a865677.

Reverted https://github.com/pytorch/pytorch/pull/115831 on behalf of https://github.com/desertfire due to rocm CI failure ([comment](https://github.com/pytorch/pytorch/pull/115831#issuecomment-1858322270))
2023-12-15 18:34:55 +00:00
Bin Bao
287a865677 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

Differential Revision: [D52189999](https://our.internmc.facebook.com/intern/diff/D52189999)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115831
Approved by: https://github.com/chenyang78
2023-12-15 14:40:44 +00:00
Bin Bao
7d4ccd7b9e [AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766)
Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766
Approved by: https://github.com/chenyang78
ghstack dependencies: #115783
2023-12-15 03:08:13 +00:00
Oguz Ulgen
21b8127f1c [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-14 23:26:04 +00:00
Scott Wolchok
81321baf5c [PyTorch] Remove ArrayRefTensor::dtype (#113578)
Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway.

Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578
Approved by: https://github.com/khabinov, https://github.com/Neilblaze
ghstack dependencies: #112800, #113577
2023-12-13 21:32:14 +00:00
Scott Wolchok
b9af126908 [PyTorch] Add input numel assert for minimal arrayref interface (#113577)
We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface.

Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113577
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #112800
2023-12-13 21:31:55 +00:00
Scott Wolchok
f9cf6ae889 [PyTorch] AOTI: add minimal arrayref interface (#112800)
This implements an optional alternate interface to the AOTI
generated DSO, intended to increase efficiency for models running on
CPU and requiring minimal overhead. See comment in config.py for more
explanation.

This took a while to get right (e.g., I initially required 1-D
MiniArrayRef<T> for the inputs, but found that multi-dimensional
ArrayRefTensor<T> ended up simplifying the implementation and allowed
test_aot_inductor.py to run) and is somewhat intricate, so I am
anticipating that review will require some back-and-forth.

Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800
Approved by: https://github.com/chenyang78
2023-12-13 12:06:35 +00:00
Scott Wolchok
2b323e61ad [PyTorch] AOTI: Use static_cast, not dynamic_cast (#112798)
dynamic_cast is for when we aren't certain about the type. We are certain (and will crash anyway if we're wrong).

Differential Revision: [D50812978](https://our.internmc.facebook.com/intern/diff/D50812978/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112798
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel, https://github.com/khabinov
ghstack dependencies: #112116, #112174, #112405
2023-12-12 06:19:45 +00:00
Scott Wolchok
ca52195112 [PyTorch] AOTI: Avoid aoti_torch_data_ptr calls for constants at inference time (#112405)
Cache aoti_torch_get_data_ptr at constants update time.

Differential Revision: [D50708982](https://our.internmc.facebook.com/intern/diff/D50708982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112405
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
ghstack dependencies: #112116, #112174
2023-12-12 06:19:45 +00:00
Scott Wolchok
24c67fe8cf [PyTorch] AOTI: Emit static constexpr int array vars when possible (#112174)
No need to populate a stack-based array for a shape/stride array when it's statically known.

Differential Revision: [D50699889](https://our.internmc.facebook.com/intern/diff/D50699889/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112174
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #112116
2023-12-12 06:19:45 +00:00
Scott Wolchok
ff6f987adc [PyTorch] Replace cached thread_locals with stack allocation in AOTI (#112116)
This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning.

Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116
Approved by: https://github.com/jansel, https://github.com/desertfire
2023-12-12 06:19:45 +00:00
Bin Bao
2e6b809d6b [AOTI] Fix a missing declaration for the result of item() (#115175)
Differential Revision: [D51968539](https://our.internmc.facebook.com/intern/diff/D51968539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115175
Approved by: https://github.com/chenyang78
2023-12-10 22:49:45 +00:00
Mu-Chu Lee
80527c0cf2 [AOTInductor] Double buffering for Weights (#114446)
Summary:
This adds function to model container doing weight swapping with double buffering.

There are 2 parts for double buffering
a) Write constants into inactive buffer
b) Swap active buffer

For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read.
For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive.

Test Plan:
test/cpp/aot_inductor/test.cpp

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446
Approved by: https://github.com/chenyang78, https://github.com/eellison
2023-12-05 22:31:56 +00:00
Yang Chen
4d8b9964e1 [aotinductor] support at::convolution for AOTInductor (#114961)
This PR adds support to at::convolution for AOTInductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114961
Approved by: https://github.com/desertfire
2023-12-03 07:52:28 +00:00