Commit Graph

55 Commits

Author SHA1 Message Date
Jason Ansel
fed37dbfbc [inductor] Cooperative reductions (#137756)
Example generated code for `(x+y).sum()`:
```py
@triton.jit
def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr):
    xnumel = 1
    rnumel = 1048576
    rsplit_id = tl.program_id(0)
    num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK
    rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK
    rsplit_start = rsplit_chunk * rsplit_id
    rsplit_end = rsplit_chunk * (rsplit_id + 1)
    xoffset = tl.program_id(1) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1)
    rbase = tl.arange(0, RBLOCK)[None, :]
    _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32)
    for roffset in range(rsplit_start, rsplit_end, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp2 = tmp0 + tmp1
        tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK])
        tmp5 = _tmp4 + tmp3
        _tmp4 = tl.where(rmask, tmp5, _tmp4)
    tmp4 = tl.sum(_tmp4, 1)[:, None]
    if RSPLIT > 1:
        tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32))
        tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None)
    if RSPLIT > 1:
        triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True)
    if RSPLIT > 1:
        tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first')
        tmp4 = tl.sum(tmp4_peers, 1)[:, None]
    if rsplit_id == (0 % RSPLIT):
        tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756
Approved by: https://github.com/eellison
ghstack dependencies: #138970
2024-10-27 16:31:38 +00:00
Tom Ritchford
932ae131fb Remove an unused variable in _inductor/codegen/simd.py (#138000)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138000
Approved by: https://github.com/Skylion007
2024-10-16 13:54:21 +00:00
Jason Ansel
7480e6938d [inductor] Add LoopBody.op_counts (#137945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137945
Approved by: https://github.com/eellison
ghstack dependencies: #137946
2024-10-16 06:35:10 +00:00
David Berard
54094c0c26 [inductor][user triton] Check size hints to determine indexing dtype (#137234)
Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64.

This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64.

Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234
Approved by: https://github.com/eellison
2024-10-03 22:07:26 +00:00
Blaine Burton Rister
86631eccda [Inductor] Remove stride-0 dimensions from more complex block pointers (#135557)
Related issue: #125077

### Feature
Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading.

This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`.

Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`).
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 64
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x2 = xindex
    x1 = (xindex // 8)
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK])
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Before this PR, we would have stride-0 dimensions:
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 64
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x2 = xindex
    x1 = (xindex // 8)
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK])
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts.
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 8
    xnumel = 8
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1])
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```
### Test Plan
Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.)

This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly.

Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557
Approved by: https://github.com/shunting314, https://github.com/jansel

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-09-27 04:01:40 +00:00
Jason Ansel
02169364e1 [inductor] Split reduction loops when there is no shared reads (#134307)
Fixes #129102

![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307
Approved by: https://github.com/shunting314
2024-09-12 09:45:08 +00:00
Rachel Guo
ae3aa8ff73 [AOTI][Tooling][5/n] Refactor the debug printer call to a level lower (#134789)
Summary:
1. Move the debug printer call a level lower -> at here
:https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335
2. Add UT for validating debug printer for user defined triton kernel codegen

The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example,  it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging

Test Plan:
```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda```

Also verified that templateKernel codegen path still works

Differential Revision: D61949020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789
Approved by: https://github.com/ColinPeppler
2024-09-04 02:41:30 +00:00
Jason Ansel
76f975948e [inductor] Cleanup generate_node_schedule (#134306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134306
Approved by: https://github.com/shunting314
2024-08-29 02:45:14 +00:00
Rachel Guo
89929d9abc [AOTI][Tooling][4/n] Add torch.save() for individual intermediate tensor (#133871)
Differential Revision: D61415304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871
Approved by: https://github.com/ColinPeppler
2024-08-28 04:48:00 +00:00
Aaron Orenstein
187d55018a [BE] Fix MYPY issues (#133872)
Fix some mypy issues that have crept in to the trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133872
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-08-20 16:12:04 +00:00
Xuehai Pan
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
Rachel Guo
3965f11837 Minor type annotation updates following up D60954888 (#133382)
Summary: As title.

Test Plan:
CI

Ran lintrunner locally but might have to continue to keep an eye on more oss linting issue if comes up.

Differential Revision: D61240900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133382
Approved by: https://github.com/ColinPeppler
2024-08-14 21:36:42 +00:00
Rachel Guo
c17d26c3c1 [AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016)
Summary:
Follow up small diff to fix a couple issues:
-  add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen

- other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option

Test Plan:
```
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D60954888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016
Approved by: https://github.com/ColinPeppler
2024-08-13 23:00:29 +00:00
Blaine Burton Rister
3d0de6e1cd [Inductor] Add config option to force higher-dimensional tiling (#132937)
Fixes #125077

**Feature**

This PR creates a new Inductor config, `config.triton.prefer_nd_tiling`, which is disabled by default. When enabled, this encourages the Triton code to use as many tiling dimensions as possible. This simplifies indexing expressions for discontiguous tensors, resulting in expressions like `5 * x + 8 * y` as opposed to `5 * (x // 7) + 8 * (y % 9)`. This allows us to find more block pointers than we normally would. We should now see simplified indexing expressions as long as:
 1. All discontiguous reads/writes have the same shape.
 2. The number of discontiguous dimensions is less than `config.triton.max_tiles`.

 Here's an example kernel (elementwise add of views) with ND tiling disabled:
 ```
 @triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 21
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 7
    x1 = (xindex // 7)
    x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (9*x1)), xmask)
    tmp1 = tl.load(in_ptr1 + (x0 + (9*x1)), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[21], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
 ```

 And here's the version with it enabled:
 ```
 @triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 3
    xnumel = 7
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[7, 3], strides=[1, 7], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
 ```

 With this feature enabled, we get a discontiguous strided block pointer. Previously, this would only have worked for specific shapes, like powers of 2 or multiples of the maximum block size. With this PR, we can support arbitrary shapes so long as we have enough tiles to cover all discontiguous dimensions.

**Test plan**

This PR adds some tests for pointwise ops with discontiguous tensors.
 - Test that we can generate block pointers for views with odd shapes like `(5,7)`, `(9,3,5)`, etc.
 - Test that we can generate block pointers for a single discontiguous dim in 3D and 4D tensors.
 - Test that we generate a 2D tiling for a 5D tensor with two discontiguous dims. This case doesn't generate a block pointer, but it checks that the output code is at least correct.

This PR also parametrizes some existing tests to run with and without `triton.prefer_nd_tiling`. That way, we ensure this feature doesn't break existing usage.

Since this setting isn't enabled on most tests, I also created https://github.com/pytorch/pytorch/pull/132935 to test what happens when `triton.prefer_nd_tiling=True` by default. None of the failures seem related to invalid tiling, so I think this feature is safe to merge.

**Limitations and follow-ups**

I can see two main improvements which would expand the usefulness of this feature:

1. This feature currently only works for pointwise kernels, since reductions are never tiled. As a follow-up, we could enable tiled reductions to extend these benefits to reduction kernels.

2. The usefulness of this feature depends on `triton.config.max_tiles`. This is currently restricted to 2 by default, although it can be increased to 3 in certain cases. To support more discontiguous dims, we might consider expanding support for 3D tiling, or even supporting ND tiling, by mapping an ND "virtual" launch grid onto Triton's 3D launch grid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132937
Approved by: https://github.com/jansel, https://github.com/eellison
2024-08-08 22:11:56 +00:00
Rachel Guo
5709375d56 [AOTI][tooling][1/n] Add intermediate value debug printer (#132323)
Summary:
**Context:**

Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866)

The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process.

**Changes:**

1. Added a simple initial debug printer helper to print out tensor values

2. Added a filter option to selectively dump tensor values.

**Usage:**

Sample cmd :

```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code"  python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda
```

Sample outputs :
```
[  before_launch - triton_poi_fused_0 - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[  after_launch - triton_poi_fused_0 - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[ before_launch - aoti_torch_cuda_addmm_out - buf1  ]:
Min value: -2.25655
Max value: 2.32996
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0

[  before_launch - aoti_torch_cuda_addmm_out - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[  after_launch - aoti_torch_cuda_addmm_out - buf1  ]:
Min value: -12.0839
Max value: 11.6878
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0

[  after_launch - aoti_torch_cuda_addmm_out - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)]
.
----------------------------------------------------------------------
Ran 1 test in 10.867s

OK

```

The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below:
```
torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out']

```

In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further  `torch.load()`.

Test Plan:
Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part)

 `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code"  python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda`

Differential Revision: D60538496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323
Approved by: https://github.com/ColinPeppler
2024-08-08 01:39:59 +00:00
Feng Shi
55b0c39d82 Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132182)
Summary:
Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases  (#124969)""

The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests.

The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause.

See D54134695 or #124969 for more details.

Test Plan:
Originally failed tests
f585704630
f585733786

Diff patched:
f586664028
f586663820

Differential Revision: D60458597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182
Approved by: https://github.com/Yuzhen11
2024-08-05 06:57:30 +00:00
Oguz Ulgen
09f9c256ad Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-04 18:43:37 +00:00
PyTorch MergeBot
f2ddd5e9e0 Revert "Add basic mypy annotations to inductor (#132416)"
This reverts commit 78927d37f6.

Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
Oguz Ulgen
78927d37f6 Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-01 20:14:25 +00:00
eellison
f32ab3b9e3 Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00
PyTorch MergeBot
784a6ec5a3 Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)"
This reverts commit 13d744464f.

Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](13d744464f) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))
2024-07-31 16:49:21 +00:00
eellison
13d744464f Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-07-31 16:22:11 +00:00
Yuzhen Huang
5298acb5c7 Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132065)
Summary:
Original commit changeset: 1d8cfdcef69d

Original Phabricator Diff: D54134695

back out: D54134695

Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc

Reviewed By: zw2326

Differential Revision: D60397377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065
Approved by: https://github.com/zw2326, https://github.com/qchip
2024-07-29 22:48:29 +00:00
Peter Bell
9ae288f4be [inductor] Simplify multi-kernel codegen by unifying kernel args (#127724)
Persistent kernels are sometimes able to remove intermediate buffers that would
otherwise be needed for the non-persistent reduction kernel. This makes
multi kernel's codegen more complicated as it needs to drop these extra
arguments at runtime after selecting the correct kernel to run.

Instead, this PR updates the persistent kernel's `must_keep_buffers` so these
aren't dropped during codegen so both kernels have the same signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724
Approved by: https://github.com/shunting314
ghstack dependencies: #131044
2024-07-26 00:12:43 +00:00
Yunqiu Guo
059f9fb30b [BE][inductor] Type annotate codecache.py and config.py (#131427)
As title.

Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-07-25 05:54:38 +00:00
Feng Shi
404d640c39 [1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
Consolidation with Foreach kernel:
1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode
2) The backend kernel is consolidated into ComboKernel.

(Note: this is part 1 which only deals with the 1st case above.)

Details:

1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels).  2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it.

2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Example:
- element wise kernels
original Pytorch function:
```
 def test_activations(a, b, c):
     a1 = torch.nn.functional.relu(a)
     b1 = torch.nn.functional.sigmoid(b)
     c1 = torch.nn.functional.tanh(c)
     return a1, b1, c1
```
combokernel
```
triton_heuristics.pointwise(
    size_hints=[512], tile_hint=TileHint.DEFAULT,
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]},
    inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []}
)
triton.jit
def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr):
    pid = tl.program_id(0)
    if pid % 3 == 0:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = triton_helpers.maximum(0, tmp0)
        tl.store(out_ptr0 + (x0), tmp1, xmask)
    elif pid % 3 == 1:
        pid_offset = pid // 3
        xnumel = 400
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x1 = xindex
        tmp2 = tl.load(in_ptr1 + (x1), xmask)
        tmp3 = tl.sigmoid(tmp2)
        tl.store(out_ptr1 + (x1), tmp3, xmask)
    elif pid % 3 == 2:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x2 = xindex
        tmp4 = tl.load(in_ptr2 + (x2), xmask)
        tmp5 = libdevice.tanh(tmp4)
        tl.store(out_ptr2 + (x2), tmp5, xmask)
    else:
        pass
```
- reduction kernels
Original Pytorch function:
```
def test_reduce(a, b, c):
     a1 = torch.sum(a, dim=0)
     b1 = torch.max(b, dim=0)
     c1 = torch.min(c, dim=0)
     return a1, b1, c1
```
Generated combokernal:
```
 triton_heuristics.persistent_reduction(
     size_hints=[32, 32],
     reduction_hint=ReductionHint.DEFAULT,
     filename=__file__,
     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*i64', 5: '*fp32', 6: '*i64', 7: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]},
     inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []}
 )
 triton.jit
 def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr):
     pid = tl.program_id(0)
     if pid % 3 == 0:
         pid_offset = pid // 3
         xnumel = 20
         rnumel = 20
         RBLOCK_0: tl.constexpr = 32
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_0)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r1 = rindex
         x0 = xindex
         tmp0 = tl.load(in_ptr0 + (x0 + (20*r1)), rmask & xmask, other=0.0)
         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0])
         tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
         tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
         tmp6 = tl.broadcast_to(rindex, tmp3.shape)
         _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1)
         tmp5 = tmp5_tmp[:, None]
         tl.store(out_ptr0 + (x0), tmp4, xmask)
         tl.store(out_ptr1 + (x0), tmp5, xmask)
     elif pid % 3 == 1:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_1: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_1)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r3 = rindex
         x2 = xindex
         tmp7 = tl.load(in_ptr1 + (x2 + (10*r3)), rmask & xmask, other=0.0)
         tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1])
         tmp10 = tl.where(rmask & xmask, tmp8, float("inf"))
         tmp11 = triton_helpers.min2(tmp10, 1)[:, None]
         tmp13 = tl.broadcast_to(rindex, tmp10.shape)
         _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1)
         tmp12 = tmp12_tmp[:, None]
         tl.store(out_ptr2 + (x2), tmp11, xmask)
         tl.store(out_ptr3 + (x2), tmp12, xmask)
     elif pid % 3 == 2:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_2: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_2)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r5 = rindex
         x4 = xindex
         tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0)
         tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2])
         tmp17 = tl.where(rmask & xmask, tmp15, 0)
         tmp18 = tl.sum(tmp17, 1)[:, None]
         tl.store(out_ptr4 + (x4), tmp18, xmask)
     else:
         pass
```

Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test/inductor:foreach
```
```
buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
```

Differential Revision: D54134695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969
Approved by: https://github.com/mlazos
2024-07-23 17:34:28 +00:00
eellison
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
Xuehai Pan
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
Peter Bell
27c2a0d63b [inductor] Separate Buffer and Operation into two concepts (#130831)
Resubmit of #128893

Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831
Approved by: https://github.com/lezcano
2024-07-20 02:05:07 +00:00
eellison
e14d1d10ef Unwrap Identity in prepare indexing (#130967)
We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](752c817898/torch/_inductor/codegen/triton.py (L1554)) which would do wrapping with tl.full.

I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64.

Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-07-18 00:43:53 +00:00
Colin Peppler
f272e0ab4a [inductor] support unbacked symint divisors in vars_and_sizes (#130595)
Scenario:
```
>>> nodes
IterationRangesEntry(
    x2,
    divisor=192*u0 + 192576,
    length=s1,
    (xindex//(192*u0 + 192576)),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x1,
    divisor=192,
    length=u0 + 1003,
    ModularIndexing(xindex, 192, u0 + 1003),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x0,
    divisor=1,
    length=192,
    ModularIndexing(xindex, 1, 192),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
```

Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595
Approved by: https://github.com/aakhundov, https://github.com/ezyang
2024-07-16 16:21:38 +00:00
chilli
f9f85bfc0b [Inductor] FlexAttention supports partial masking (#130415) (#130626)
This is the new version of https://github.com/pytorch/pytorch/pull/130415

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```
Approved by: https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626
Approved by: https://github.com/drisspg, https://github.com/yanboliang
2024-07-14 00:37:26 +00:00
PyTorch MergeBot
da030e7add Revert "[Inductor] FlexAttention supports partial masking (#130415)"
This reverts commit 207564bab1.

Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant  ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))
2024-07-12 13:20:18 +00:00
Yanbo Liang
207564bab1 [Inductor] FlexAttention supports partial masking (#130415)
This is the new version of #130235

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415
Approved by: https://github.com/Chillee
2024-07-12 07:19:28 +00:00
Richard Zou
edf273edf4 Revert some PRs (#130303)
Summary:
Revert https://github.com/pytorch/pytorch/pull/129346 thru
https://github.com/pytorch/pytorch/pull/128893

For S430832

Test Plan: Tests

Differential Revision: D59503843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303
Approved by: https://github.com/bdhirsh
2024-07-09 14:46:00 +00:00
Peter Bell
fb078c20c1 [inductor] Separate Buffer and Operation into two concepts (#128893)
Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893
Approved by: https://github.com/lezcano
2024-07-02 23:49:57 +00:00
Peter Bell
90d5a6f001 [inductor] Add lowering and codegen for aten.sort (#128458)
Closes #125633

Benchmarks:
| Shape       | dim | stable | compiled | eager   | speedup |
|-------------|-----|--------|----------|---------|---------|
| (256, 4096) | 0   | False  | 0.73 ms  | 1.26 ms | 1.7     |
| (256, 4096) | 0   | True   | 0.75 ms  | 1.27 ms | 1.7     |
| (4096, 256) | 1   | False  | 0.20 ms  | 0.73 ms | 3.7     |
| (4096, 256) | 1   | True   | 0.21 ms  | 0.73 ms | 3.5     |
| (255, 4096) | 0   | False  | 1.05 ms  | 1.48 ms | 1.4     |
| (255, 4096) | 0   | True   | 1.03 ms  | 1.47 ms | 1.4     |
| (4096, 255) | 1   | False  | 0.52 ms  | 0.98 ms | 1.9     |
| (4096, 255) | 1   | True   | 0.54 ms  | 1.00 ms | 1.9     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458
Approved by: https://github.com/lezcano, https://github.com/eellison
2024-06-26 01:36:39 +00:00
Jason Ansel
feb3f3ad77 [inductor] Refactors for Halide backend (#129024)
Pulling these inductor-related refactors out of the larger Halide
backend PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129024
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-06-21 16:53:35 +00:00
Peter Bell
d7fc871175 [inductor] Improve superfluous mask handling in triton codegen (#128518)
This takes the logic from `filter_masks` and factors it out into
`_has_constant_mask`. I also improve support for `persistent_reduction` kernels
by making use of the static RBLOCK value and potentially XBLOCK too in the
`no_x_dim` case.

I then use this helper when generating the `xmask` and `rmask`, so we can
generate them as constants meaning triton can optimize them even if they are
included.

e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)`
before:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), rmask & xmask, other=0.0)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = tl.where(rmask & xmask, tmp1, 0)
    tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0))
    tl.store(out_ptr0 + (x0), tmp4, xmask)
```

after:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = tl.full([RBLOCK], True, tl.int1)
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = tl.full([RBLOCK], True, tl.int1)
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0))
    tl.store(out_ptr0 + (x0), tmp3, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518
Approved by: https://github.com/lezcano
2024-06-14 17:52:55 +00:00
Isuru Fernando
e397ad6883 Improve codegen for ops.masked in triton (#128054)
Fixes https://github.com/pytorch/pytorch/issues/127930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-06-14 11:52:56 +00:00
Jason Ansel
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
Aaron Orenstein
ea614fb2b1 Flip default value for mypy disallow_untyped_defs [2/11] (#127839)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839
Approved by: https://github.com/oulgen
2024-06-08 18:23:08 +00:00
Shunting Zhang
0c7f4353e5 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-07 17:51:30 +00:00
PyTorch MergeBot
23c156cd2d Revert "[inductor] simplify indexing (#127661)"
This reverts commit 901226ae83.

Reverted https://github.com/pytorch/pytorch/pull/127661 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with https://github.com/pytorch/pytorch/pull/126905 which needs to be reverted, will be relanding it ([comment](https://github.com/pytorch/pytorch/pull/127661#issuecomment-2155115388))
2024-06-07 15:58:36 +00:00
Shunting Zhang
901226ae83 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-06 23:57:45 +00:00
Aaron Gokaslan
2d47385f0f [BE]: Enable ruff TCH rules and autofixes for better imports (#127688)
Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies.

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688
Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet
2024-06-06 16:55:58 +00:00
chilli
e0fc1ab625 Forward fix for templates + views (#127446)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127446
Approved by: https://github.com/eellison
2024-05-30 02:34:35 +00:00
lezcano
8a21532e53 Fix constant propagation pass (#114471)
This pass was broken in a number of ways, as we were not generating
asserts whenever we took it, even though we need to. While doing so,
we found that the analysis we were using for choosing
whether to generate asserts or not for dynamic shapes was completely
broken.

Eliminating indirect indexing in this way allows for a number of optimisations.
In particular, we can now fuse against these kernels (indirect indexing disallows fusions).

The new strategy is as follows:

- We always propagate sympy expressions if we can.
- If an expression was an indirect_indexing, we call `check_bounds`
- We also call `check_bounds` within `CSEProxy.indirect_indexing`
- The checks are issued in the buffer where they would go if the were used in a load
   - This makes them always be codegen'd before the load and stores
   - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine.

We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure
that issuing an assert plays well with all kinds of C++ vectorisation.

For now, we rely on the logic within `_maybe_evaluate_static` to prove
these bounds. This logic is rather limited though. In the future, we might want
to rely on Z3 here to be able to prove bounds in a more general way.

Supersedes https://github.com/pytorch/pytorch/pull/113068
Fixes https://github.com/pytorch/pytorch/issues/121251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471
Approved by: https://github.com/peterbell10
2024-05-29 09:10:25 +00:00
chilli
ec8b254ef4 Refactored template codegen to explicitly set current body when generating code (#127144)
The main motivation for this refactor is that today, when generating templates, this is what happens.

```
def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.
```

Later on, when we codegen the template: f8c4c268da/torch/_inductor/codegen/simd.py (L1402)

```
epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`
```

Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`.
2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
3. The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:
1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies.
2. I add functions that allow you to finalize specific hooks on `PartialRender`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127144
Approved by: https://github.com/jansel
2024-05-28 09:49:13 +00:00
Jason Ansel
d5bf3a98db [inductor] Refactor indexing() into triton.py (#127047)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047
Approved by: https://github.com/shunting314
ghstack dependencies: #126944, #126945
2024-05-24 22:46:20 +00:00