Commit Graph

427 Commits

Author SHA1 Message Date
Edward Z. Yang
cc8f1cddd4 Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934, #136935
2024-10-01 13:22:10 +00:00
niklasz
3f457ee1f6 Fix AOT Graph capture not propagating non_blocking copy parameter to … (#136513)
…inductor codegen.

Fixes #136260

**Note**: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow.

Regarding the bug, re-running the issue's reproduction code:
```
import torch

def fn(x):
    return x.to(device="cuda", non_blocking=True)

inp = torch.randn(3, 4)

torch.compile(fn)(inp)
```

We now have the non_blocking being passed on to codegen properly:

```
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  ===== pre insert_deferred_runtime_asserts __compiled_fn_1 =====
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4]"):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  ===== __compiled_fn_1 =====
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4][4, 1]cpu"):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  ===== Forward graph 0 =====
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]     def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True);  arg0_1 = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32);  device_put = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         return (convert_element_type,)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code:
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference']
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals())
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1, = args
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     args.clear()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     assert_size_stride(arg0_1, (3, 4), (4, 1))
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     with torch.cuda._DeviceGuard(0):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         torch.cuda.set_device(0)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0.copy_(arg0_1, True)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         del arg0_1
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return (buf0, )
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._dynamo.testing import rand_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.utils import print_performance
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     fn = lambda: call([arg0_1])
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__":
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
```
See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513
Approved by: https://github.com/eellison
2024-10-01 00:32:47 +00:00
eellison
18e707645c Substitute unbacked symints in expressions (#137020)
Differential Revision: [D63647095](https://our.internmc.facebook.com/intern/diff/D63647095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137020
Approved by: https://github.com/ezyang
2024-09-30 23:07:22 +00:00
PyTorch MergeBot
8982906502 Revert "Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)"
This reverts commit 3ff2d93d9f.

Reverted https://github.com/pytorch/pytorch/pull/136972 on behalf of https://github.com/ezyang due to need to back out for merge conflict ([comment](https://github.com/pytorch/pytorch/pull/136972#issuecomment-2384182244))
2024-09-30 21:35:08 +00:00
Jez Ng
71aac59e93 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-30 20:24:52 +00:00
Edward Z. Yang
3ff2d93d9f Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972
Approved by: https://github.com/Skylion007
ghstack dependencies: #136917, #136934, #136935
2024-09-30 18:04:36 +00:00
PyTorch MergeBot
36428f91e9 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit 31c0467594.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))
2024-09-27 16:54:27 +00:00
Jez Ng
31c0467594 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-26 15:35:26 +00:00
Bin Bao
20a855bf01 [AOTI] Move stack_allocation logic from PythonWrapperCodegen (#136463)
Summary: Move stack_allocation logic from PythonWrapperCodegen to CppWrapperCpuArrayRef

Differential Revision: [D63319970](https://our.internmc.facebook.com/intern/diff/D63319970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136463
Approved by: https://github.com/chenyang78
ghstack dependencies: #136062, #136461, #136462
2024-09-25 14:06:33 +00:00
Jokeren
cabfbef6cf [pytorch][PR] [inductor] More fixes on the keys of constants and signature dictionaries (#136514)
Summary: Previous PR forgets to change two other places that also create `constants` and `signature`.

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
 {F1884584338}

Differential Revision: D63027728

Pulled By: Myrthan

Co-authored-by: Jokeren <robinho364@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514
Approved by: https://github.com/jansel

Co-authored-by: Jokeren <robinho364@gmail.com>
2024-09-25 09:34:14 +00:00
Bin Bao
95c0f7493f [Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062)
Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit.

Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2024-09-24 21:02:51 +00:00
Shangdi Yu
3bc073d728 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-22 04:51:37 +00:00
PyTorch MergeBot
d0cebedb31 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit e498b02b47.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
PyTorch MergeBot
0199fd4d7e Revert "[inductor] More fixes on the keys of constants and signature dictionaries (#135406)"
This reverts commit e54b559e88.

Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))
2024-09-16 17:58:02 +00:00
Jez Ng
e498b02b47 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00
PyTorch MergeBot
18f9331e5d Revert "[aoti] Fix workspace generation for triton (#135552)"
This reverts commit d383325392.

Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))
2024-09-13 17:47:36 +00:00
Rachel Guo
7834c0bb2c [AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887)
Summary:
As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well.

The inductor python wrapper code level printing would look something like this:

 {F1859224287}

Test Plan: CI

Reviewed By: chenyang78

Differential Revision: D62415575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887
Approved by: https://github.com/chenyang78
2024-09-13 17:19:25 +00:00
Jokeren
e54b559e88 [inductor] More fixes on the keys of constants and signature dictionaries (#135406)
Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406
Approved by: https://github.com/jansel
2024-09-13 04:10:41 +00:00
Shangdi Yu
d383325392 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-12 23:53:09 +00:00
Rachel Guo
c1277945d3 [AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731)
Summary:
As title.

Effect after merging this diff would look something like this:

```
        print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0)
        triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0)
        print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0)
        buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32)
        # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm]
        print('inductor: before_launch - extern_kernels.addmm - buf0', buf0)
        extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1)
        print('inductor: after_launch - extern_kernels.addmm - buf0', buf0)
```

Context: D62272588 only support major triton kernel jit inductor debug printing codegen

Test Plan: CI & OSS CI

Reviewed By: chenyang78, ColinPeppler

Differential Revision: D62397017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731
Approved by: https://github.com/ColinPeppler
2024-09-12 17:31:10 +00:00
xinan.lin
13ee85ca5e [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312)
[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison
2024-09-11 23:59:54 +00:00
Rachel Guo
1f15973657 [AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285)
Summary:
1.  Add the debug printer call to a level lower for triton kernel python wrapper codegen path
2. Add `torch.save()` for jit inductor as well
3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing)

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D62272588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285
Approved by: https://github.com/chenyang78
2024-09-10 19:24:58 +00:00
xinan.lin
ca16956b20 [Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #134693
2024-09-10 10:11:52 +00:00
Bin Bao
9c6dff4941 [AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857)
Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857
Approved by: https://github.com/angelayi
2024-09-09 16:54:12 +00:00
Henry Tsang
3988b3468b [aoti][easy] remove breakpoint() in wrapper.py (#134807)
Differential Revision: D61687146

Remove an unintended breakpoint in code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807
Approved by: https://github.com/YUNQIUGUO
2024-09-06 17:25:05 +00:00
qchip
f6398eb0fa dynamic shapes for combo_kenel/foreach_kernel (#134477)
This PR add dynamic shapes support to foreach and combo kernels for horizontal fusion.
A flag `combo_kernel_foreach_dynamic_shapes` (default False to avoid disturb production workflows) is added to _inductor/config.py. Setting it to True enables automatic dynamic shapes for foreach kernels. It is always enabled for combo kernels cases. Added unit cases.

This PR also fixes a flaky test case for [T198833257](https://www.internalfb.com/intern/tasks/?t=198833257)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134477
Approved by: https://github.com/mlazos
2024-08-30 19:58:20 +00:00
Yueming Hao
4f9c68454a [inductor]Let output or input_as_strided match exact strides (#130956)
Fixes #130394

TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue.  This PR enables dense and non-dense outputs' strides follow the strides required by semantics.

The comparison between the original and after this fix for the test is the below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 128
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 8
    x1 = (xindex // 8)
-   x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
    tmp1 = tmp0 + tmp0
-   tl.store(out_ptr0 + (x2), tmp1, xmask)
+   tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (16, 8), (16, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
-       buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+       buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
        del arg0_1
    return (buf1, )
```

The buf1 is created with exact stride required by users, and its values are written in same stride with the input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire
2024-08-29 03:06:58 +00:00
Rachel Guo
89929d9abc [AOTI][Tooling][4/n] Add torch.save() for individual intermediate tensor (#133871)
Differential Revision: D61415304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871
Approved by: https://github.com/ColinPeppler
2024-08-28 04:48:00 +00:00
Bin Bao
a4b44dd2ef [AOTI] Introduce DeferredCudaGridLine for cuda cpp wrapper (#129268)
Summary: Similar to https://github.com/pytorch/pytorch/pull/129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper.

Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129268
Approved by: https://github.com/angelayi
2024-08-27 19:23:25 +00:00
Jason Ansel
2c8fc3f4ce [inductor] Move imports to top of file in generated code (#134195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134195
Approved by: https://github.com/eellison
ghstack dependencies: #134194
2024-08-24 00:35:57 +00:00
Mwiza Kunda
be207af6e1 Disable unwrapping scalar tensors when used as outputs (#132859)
If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor.

Fixes #ISSUE_NUMBER

@yanboliang @jansel @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859
Approved by: https://github.com/jansel
2024-08-16 21:40:45 +00:00
Max Podkorytov
3d45717219 [ROCm][CK][Inductor] enable dynamic shapes for CK backend to gemm max autotune (#133285)
This PR enables dynamic shapes for the CK backend for gemm max autotune (see #125453).

This is achieved via unhardcoding the problem sizes from the template body and passing them as parameters instead.

We handle passing the problem sizes for the kernel call as well as for the benchmark call.

# Testing

`pytest test/inductor/test_ck_backend.py [-k dynamic]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133285
Approved by: https://github.com/ColinPeppler
2024-08-16 06:05:23 +00:00
Xuehai Pan
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
Rachel Guo
c17d26c3c1 [AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016)
Summary:
Follow up small diff to fix a couple issues:
-  add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen

- other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option

Test Plan:
```
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D60954888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016
Approved by: https://github.com/ColinPeppler
2024-08-13 23:00:29 +00:00
xinan.lin
92f650c5b3 [Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255. (#126678)
[Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126678
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison
2024-08-10 14:33:39 +00:00
Feng Shi
55b0c39d82 Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132182)
Summary:
Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases  (#124969)""

The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests.

The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause.

See D54134695 or #124969 for more details.

Test Plan:
Originally failed tests
f585704630
f585733786

Diff patched:
f586664028
f586663820

Differential Revision: D60458597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182
Approved by: https://github.com/Yuzhen11
2024-08-05 06:57:30 +00:00
Edward Z. Yang
290f09f829 Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-02 12:00:46 +00:00
PyTorch MergeBot
c8958f8f84 Revert "Ban decorator usage of dynamo_timed (#132328)"
This reverts commit 9853c048eb.

Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](9853c048eb).  Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))
2024-08-01 20:20:28 +00:00
Edward Z. Yang
9853c048eb Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-01 19:27:58 +00:00
Yuzhen Huang
5298acb5c7 Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132065)
Summary:
Original commit changeset: 1d8cfdcef69d

Original Phabricator Diff: D54134695

back out: D54134695

Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc

Reviewed By: zw2326

Differential Revision: D60397377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065
Approved by: https://github.com/zw2326, https://github.com/qchip
2024-07-29 22:48:29 +00:00
Colin Peppler
2ff98bc57f [inductor][autotune_at_compile_time] fix some codegen-ing for standalone autotuning file (#131726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131726
Approved by: https://github.com/desertfire
ghstack dependencies: #131253
2024-07-26 00:58:04 +00:00
Colin Peppler
f885a70fab [inductor][autotune_at_compile_time] support Triton kernel with sympy fn str arg (#131253)
## What is sympy fn str arg?
It's  a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`)

## Crash

```
torch/_inductor/sizevars.py", line 468, in symbolic_hint
    expr = self.simplify(expr)        # where expr is 'sqrt'
torch/_inductor/sizevars.py", line 66, in simplify
    return sympy.expand(expr).xreplace(self.replacements)
sympy/core/function.py", line 2816, in expand
    return sympify(e).expand(deep=deep, modulus=modulus, **hints)
AttributeError: 'function' object has no attribute 'expand'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253
Approved by: https://github.com/desertfire
2024-07-25 23:31:20 +00:00
Feng Shi
404d640c39 [1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
Consolidation with Foreach kernel:
1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode
2) The backend kernel is consolidated into ComboKernel.

(Note: this is part 1 which only deals with the 1st case above.)

Details:

1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels).  2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it.

2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Example:
- element wise kernels
original Pytorch function:
```
 def test_activations(a, b, c):
     a1 = torch.nn.functional.relu(a)
     b1 = torch.nn.functional.sigmoid(b)
     c1 = torch.nn.functional.tanh(c)
     return a1, b1, c1
```
combokernel
```
triton_heuristics.pointwise(
    size_hints=[512], tile_hint=TileHint.DEFAULT,
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]},
    inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []}
)
triton.jit
def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr):
    pid = tl.program_id(0)
    if pid % 3 == 0:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = triton_helpers.maximum(0, tmp0)
        tl.store(out_ptr0 + (x0), tmp1, xmask)
    elif pid % 3 == 1:
        pid_offset = pid // 3
        xnumel = 400
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x1 = xindex
        tmp2 = tl.load(in_ptr1 + (x1), xmask)
        tmp3 = tl.sigmoid(tmp2)
        tl.store(out_ptr1 + (x1), tmp3, xmask)
    elif pid % 3 == 2:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x2 = xindex
        tmp4 = tl.load(in_ptr2 + (x2), xmask)
        tmp5 = libdevice.tanh(tmp4)
        tl.store(out_ptr2 + (x2), tmp5, xmask)
    else:
        pass
```
- reduction kernels
Original Pytorch function:
```
def test_reduce(a, b, c):
     a1 = torch.sum(a, dim=0)
     b1 = torch.max(b, dim=0)
     c1 = torch.min(c, dim=0)
     return a1, b1, c1
```
Generated combokernal:
```
 triton_heuristics.persistent_reduction(
     size_hints=[32, 32],
     reduction_hint=ReductionHint.DEFAULT,
     filename=__file__,
     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*i64', 5: '*fp32', 6: '*i64', 7: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]},
     inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []}
 )
 triton.jit
 def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr):
     pid = tl.program_id(0)
     if pid % 3 == 0:
         pid_offset = pid // 3
         xnumel = 20
         rnumel = 20
         RBLOCK_0: tl.constexpr = 32
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_0)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r1 = rindex
         x0 = xindex
         tmp0 = tl.load(in_ptr0 + (x0 + (20*r1)), rmask & xmask, other=0.0)
         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0])
         tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
         tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
         tmp6 = tl.broadcast_to(rindex, tmp3.shape)
         _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1)
         tmp5 = tmp5_tmp[:, None]
         tl.store(out_ptr0 + (x0), tmp4, xmask)
         tl.store(out_ptr1 + (x0), tmp5, xmask)
     elif pid % 3 == 1:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_1: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_1)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r3 = rindex
         x2 = xindex
         tmp7 = tl.load(in_ptr1 + (x2 + (10*r3)), rmask & xmask, other=0.0)
         tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1])
         tmp10 = tl.where(rmask & xmask, tmp8, float("inf"))
         tmp11 = triton_helpers.min2(tmp10, 1)[:, None]
         tmp13 = tl.broadcast_to(rindex, tmp10.shape)
         _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1)
         tmp12 = tmp12_tmp[:, None]
         tl.store(out_ptr2 + (x2), tmp11, xmask)
         tl.store(out_ptr3 + (x2), tmp12, xmask)
     elif pid % 3 == 2:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_2: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_2)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r5 = rindex
         x4 = xindex
         tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0)
         tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2])
         tmp17 = tl.where(rmask & xmask, tmp15, 0)
         tmp18 = tl.sum(tmp17, 1)[:, None]
         tl.store(out_ptr4 + (x4), tmp18, xmask)
     else:
         pass
```

Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test/inductor:foreach
```
```
buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
```

Differential Revision: D54134695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969
Approved by: https://github.com/mlazos
2024-07-23 17:34:28 +00:00
Yueming Hao
979429ca89 [inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883)
Fixes #126338
## Issue Summary

When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.

See more great discussions in the #126338

This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.

## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.

## Example

```python
@torch.compile
def fn(x, y):
    x = x.view(torch.float16)
    y = y.view(torch.float16) + 1
    return x @ y

x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tl.store(out_ptr0 + (x0), tmp1, xmask)

triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)

def call(args):
...
        triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg0_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
        triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
        extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.

The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.

The following are output code generated after this PR.

```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
        triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
        extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.

The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.

7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)
7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)

Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison
2024-07-23 17:31:39 +00:00
eellison
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
xinan.lin
8da19fec60 [Inductor] Support store SPIR-V binary file output from Intel Triton. (#130849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130849
Approved by: https://github.com/peterbell10, https://github.com/EikanWang
2024-07-22 05:59:03 +00:00
Xuehai Pan
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
Peter Bell
27c2a0d63b [inductor] Separate Buffer and Operation into two concepts (#130831)
Resubmit of #128893

Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831
Approved by: https://github.com/lezcano
2024-07-20 02:05:07 +00:00
Bin Bao
752c817898 [AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#130796)
Summary: Unify the argment codegen logic between python wrapper and cpp wrapper.

Differential Revision: [D59809273](https://our.internmc.facebook.com/intern/diff/D59809273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130796
Approved by: https://github.com/oulgen
2024-07-17 18:37:23 +00:00
Xuehai Pan
4d7bf72d93 [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206
Approved by: https://github.com/malfet
2024-07-14 08:17:52 +00:00