Commit Graph

418 Commits

Author SHA1 Message Date
Oguz Ulgen
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
eellison
f32ab3b9e3 Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00
Peter Bell
260c991e20 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-31 21:32:20 +00:00
PyTorch MergeBot
784a6ec5a3 Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)"
This reverts commit 13d744464f.

Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](13d744464f) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))
2024-07-31 16:49:21 +00:00
eellison
13d744464f Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-07-31 16:22:11 +00:00
Yuzhen Huang
5298acb5c7 Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132065)
Summary:
Original commit changeset: 1d8cfdcef69d

Original Phabricator Diff: D54134695

back out: D54134695

Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc

Reviewed By: zw2326

Differential Revision: D60397377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065
Approved by: https://github.com/zw2326, https://github.com/qchip
2024-07-29 22:48:29 +00:00
eellison
8b507a922a Mode to emulate amp numerics (#131595)
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```

We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.

in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.

This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
2024-07-29 22:42:23 +00:00
PyTorch MergeBot
957a89f56c Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761)"
This reverts commit 03760be271.

Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](03760be271) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))
2024-07-29 15:52:08 +00:00
Peter Bell
03760be271 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-29 03:14:13 +00:00
Peter Bell
16cd1aaa1d [inductor] Improve sort kernel perf (#131719)
Closes #129507

This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
   mask from that which saves an additional reduction in the sort
   kernel's inner loop.

In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
2024-07-26 21:56:47 +00:00
Peter Bell
2784b3f1b7 [inductor] Fix split-scan interaction with multi-kernel (#131044)
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
   a multi-kernel to be created but this isn't supported. Fix is to
   never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
   KeyError during codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
2024-07-25 11:36:36 +00:00
Feng Shi
404d640c39 [1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
Consolidation with Foreach kernel:
1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode
2) The backend kernel is consolidated into ComboKernel.

(Note: this is part 1 which only deals with the 1st case above.)

Details:

1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels).  2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it.

2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Example:
- element wise kernels
original Pytorch function:
```
 def test_activations(a, b, c):
     a1 = torch.nn.functional.relu(a)
     b1 = torch.nn.functional.sigmoid(b)
     c1 = torch.nn.functional.tanh(c)
     return a1, b1, c1
```
combokernel
```
triton_heuristics.pointwise(
    size_hints=[512], tile_hint=TileHint.DEFAULT,
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]},
    inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []}
)
triton.jit
def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr):
    pid = tl.program_id(0)
    if pid % 3 == 0:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = triton_helpers.maximum(0, tmp0)
        tl.store(out_ptr0 + (x0), tmp1, xmask)
    elif pid % 3 == 1:
        pid_offset = pid // 3
        xnumel = 400
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x1 = xindex
        tmp2 = tl.load(in_ptr1 + (x1), xmask)
        tmp3 = tl.sigmoid(tmp2)
        tl.store(out_ptr1 + (x1), tmp3, xmask)
    elif pid % 3 == 2:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x2 = xindex
        tmp4 = tl.load(in_ptr2 + (x2), xmask)
        tmp5 = libdevice.tanh(tmp4)
        tl.store(out_ptr2 + (x2), tmp5, xmask)
    else:
        pass
```
- reduction kernels
Original Pytorch function:
```
def test_reduce(a, b, c):
     a1 = torch.sum(a, dim=0)
     b1 = torch.max(b, dim=0)
     c1 = torch.min(c, dim=0)
     return a1, b1, c1
```
Generated combokernal:
```
 triton_heuristics.persistent_reduction(
     size_hints=[32, 32],
     reduction_hint=ReductionHint.DEFAULT,
     filename=__file__,
     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*i64', 5: '*fp32', 6: '*i64', 7: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]},
     inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []}
 )
 triton.jit
 def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr):
     pid = tl.program_id(0)
     if pid % 3 == 0:
         pid_offset = pid // 3
         xnumel = 20
         rnumel = 20
         RBLOCK_0: tl.constexpr = 32
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_0)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r1 = rindex
         x0 = xindex
         tmp0 = tl.load(in_ptr0 + (x0 + (20*r1)), rmask & xmask, other=0.0)
         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0])
         tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
         tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
         tmp6 = tl.broadcast_to(rindex, tmp3.shape)
         _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1)
         tmp5 = tmp5_tmp[:, None]
         tl.store(out_ptr0 + (x0), tmp4, xmask)
         tl.store(out_ptr1 + (x0), tmp5, xmask)
     elif pid % 3 == 1:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_1: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_1)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r3 = rindex
         x2 = xindex
         tmp7 = tl.load(in_ptr1 + (x2 + (10*r3)), rmask & xmask, other=0.0)
         tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1])
         tmp10 = tl.where(rmask & xmask, tmp8, float("inf"))
         tmp11 = triton_helpers.min2(tmp10, 1)[:, None]
         tmp13 = tl.broadcast_to(rindex, tmp10.shape)
         _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1)
         tmp12 = tmp12_tmp[:, None]
         tl.store(out_ptr2 + (x2), tmp11, xmask)
         tl.store(out_ptr3 + (x2), tmp12, xmask)
     elif pid % 3 == 2:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_2: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_2)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r5 = rindex
         x4 = xindex
         tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0)
         tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2])
         tmp17 = tl.where(rmask & xmask, tmp15, 0)
         tmp18 = tl.sum(tmp17, 1)[:, None]
         tl.store(out_ptr4 + (x4), tmp18, xmask)
     else:
         pass
```

Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test/inductor:foreach
```
```
buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
```

Differential Revision: D54134695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969
Approved by: https://github.com/mlazos
2024-07-23 17:34:28 +00:00
Yueming Hao
979429ca89 [inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883)
Fixes #126338
## Issue Summary

When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.

See more great discussions in the #126338

This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.

## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.

## Example

```python
@torch.compile
def fn(x, y):
    x = x.view(torch.float16)
    y = y.view(torch.float16) + 1
    return x @ y

x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tl.store(out_ptr0 + (x0), tmp1, xmask)

triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)

def call(args):
...
        triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg0_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
        triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
        extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.

The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.

The following are output code generated after this PR.

```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
        triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
        extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.

The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.

7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)
7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)

Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison
2024-07-23 17:31:39 +00:00
eellison
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
Xuehai Pan
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
peaceorwell
6657b14a64 [inductor] Fix the method for checking the variable type of entry.numel (#131026)
The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer.

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026
Approved by: https://github.com/peterbell10
2024-07-19 22:51:11 +00:00
chilli
31fc5b8966 Add support for inline_asm_elementwise in Inductor lowerings (#129846)
This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it.

<img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846
Approved by: https://github.com/shunting314
2024-07-03 02:34:03 +00:00
PyTorch MergeBot
03440a1c13 Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846)"
This reverts commit badc638eb6.

Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))
2024-07-02 15:25:34 +00:00
chilli
badc638eb6 Add support for inline_asm_elementwise in Inductor lowerings (#129846)
This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it.

<img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846
Approved by: https://github.com/shunting314
2024-07-02 09:31:38 +00:00
Jason Ansel
b93bf55b6a [halide-backend] Add GPU support (#127506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026
2024-06-29 14:06:21 +00:00
Andres Lugo
b9a1c2c991 [ROCm] Enable F8 Inductor Unit tests (#128353)
First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-26 18:30:43 +00:00
Peter Bell
90d5a6f001 [inductor] Add lowering and codegen for aten.sort (#128458)
Closes #125633

Benchmarks:
| Shape       | dim | stable | compiled | eager   | speedup |
|-------------|-----|--------|----------|---------|---------|
| (256, 4096) | 0   | False  | 0.73 ms  | 1.26 ms | 1.7     |
| (256, 4096) | 0   | True   | 0.75 ms  | 1.27 ms | 1.7     |
| (4096, 256) | 1   | False  | 0.20 ms  | 0.73 ms | 3.7     |
| (4096, 256) | 1   | True   | 0.21 ms  | 0.73 ms | 3.5     |
| (255, 4096) | 0   | False  | 1.05 ms  | 1.48 ms | 1.4     |
| (255, 4096) | 0   | True   | 1.03 ms  | 1.47 ms | 1.4     |
| (4096, 255) | 1   | False  | 0.52 ms  | 0.98 ms | 1.9     |
| (4096, 255) | 1   | True   | 0.54 ms  | 1.00 ms | 1.9     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458
Approved by: https://github.com/lezcano, https://github.com/eellison
2024-06-26 01:36:39 +00:00
Isuru Fernando
e6bfa2958b Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-25 02:45:02 +00:00
lezcano
53be7ff0e4 Make tl.atomic_add relaxed (#129133)
We don't use any fancy synchronization within out atomic ops, we just
want them to be atomic, so better to have them be relaxed than the
default aquire/release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129133
Approved by: https://github.com/peterbell10
2024-06-21 14:49:58 +00:00
Bin Bao
ba92f5277f [inductor][refactor] Unify the use of generate_kernel_call (#128467)
Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack.

Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467
Approved by: https://github.com/shunting314
2024-06-19 07:47:25 +00:00
Blaine Burton Rister
f1ee3589a1 [Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342)
**Summary**

Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets.

Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be:
  - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1.
  - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size.

Note that a *slice* of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`.

Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077

Example kernel:
```
triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4096
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK])
    tmp1 = tmp0 + tmp0
    tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32))
''', device_str='cuda')
```

**Test Plan**

This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories:
  - Can we generate strided block pointers for the appropriate shapes?
     - Powers of 2
     - Non-power of 2, but multiple of the maximum block size
     - Arbitrary leading dimensions, with power of 2 inner dimensions
     - Weird strides and offsets
     - Reductions
     - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo)
     - Broadcasts (some variables are missing from the indexing expression)
  - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers?
     - Unsupported static shapes
     - Unsupported symbolic shapes
  - Mixing and matching these cases:
     - Pointwise and reduction in the same kernel
  - Sanity check the test harness
     - Do we raise an exception if the expected number of block pointers and the actual number are different?

**Follow-ups**

There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs:
  - Handle non-divisible shapes
      - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted.
      - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton.
 - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead?

Differential Revision: D56739375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-06-16 07:35:57 +00:00
Peter Bell
d7fc871175 [inductor] Improve superfluous mask handling in triton codegen (#128518)
This takes the logic from `filter_masks` and factors it out into
`_has_constant_mask`. I also improve support for `persistent_reduction` kernels
by making use of the static RBLOCK value and potentially XBLOCK too in the
`no_x_dim` case.

I then use this helper when generating the `xmask` and `rmask`, so we can
generate them as constants meaning triton can optimize them even if they are
included.

e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)`
before:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), rmask & xmask, other=0.0)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = tl.where(rmask & xmask, tmp1, 0)
    tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0))
    tl.store(out_ptr0 + (x0), tmp4, xmask)
```

after:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = tl.full([RBLOCK], True, tl.int1)
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = tl.full([RBLOCK], True, tl.int1)
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0))
    tl.store(out_ptr0 + (x0), tmp3, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518
Approved by: https://github.com/lezcano
2024-06-14 17:52:55 +00:00
Isuru Fernando
e397ad6883 Improve codegen for ops.masked in triton (#128054)
Fixes https://github.com/pytorch/pytorch/issues/127930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-06-14 11:52:56 +00:00
Jason Ansel
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
Edward Z. Yang
3964a3ec73 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

**Reland notes.** This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/

It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-09 06:20:25 +00:00
Aaron Orenstein
ea614fb2b1 Flip default value for mypy disallow_untyped_defs [2/11] (#127839)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839
Approved by: https://github.com/oulgen
2024-06-08 18:23:08 +00:00
Mu-Chu Lee
d9696ea624 [AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574)
Summary:
1. Integrate NaN and INF checker with existing config, controllable by env var.
2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check.
3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel.

Test Plan:
Debugging utility.
Test and check by existing tests with env var:
```
TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda
```

Reviewed By: ColinPeppler

Differential Revision: D57989176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-07 16:46:26 +00:00
PyTorch MergeBot
ac51f782fe Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit 2f7cfecd86.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))
2024-06-07 16:01:46 +00:00
Aaron Gokaslan
2d47385f0f [BE]: Enable ruff TCH rules and autofixes for better imports (#127688)
Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies.

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688
Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet
2024-06-06 16:55:58 +00:00
Edward Z. Yang
2f7cfecd86 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-06 02:29:45 +00:00
PyTorch MergeBot
d5cb5d623a Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit fb696ef3aa.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))
2024-06-05 03:57:58 +00:00
Edward Z. Yang
fb696ef3aa Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-04 11:47:32 +00:00
PyTorch MergeBot
d1fad416a8 Revert "Add aten._unsafe_masked_index (#116491)"
This reverts commit f03f8bc901.

Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))
2024-06-03 15:51:50 +00:00
Isuru Fernando
f03f8bc901 Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-03 14:44:03 +00:00
Yueming Hao
2a03bf5a14 [inductor] fix grid z bug for large grid (#127448)
Fixes #123210

2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1733-L1753)

If a kernel's y_grid  is larger than 65535, it will be split into multiple z grids. The above grad_fn does this split before the kernel launch; however, the computations for yoffset and the y_grid are incorrect. For example, if we have xy numel of `(1*XBLOCK, 65537*YBLOCK)`, this function will return an [xyz]_grid with (1, 32768, 2). XBLOCK and YBLOCK here are used for the following `get_grid_dim`. Let's use their default values (4, 1024).

2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1734)

[xyz]_grid = (1, 32768, 2) means the workload are divided to two z grids. Because the triton kernel generation still follows xy dimension, one of the exampled generated kernel is shown below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 65537*1024
    xnumel = 1*4
    yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x2 = xindex
    y0 = yindex % 128
    y1 = (yindex // 128)
    y3 = yindex
    tmp0 = tl.load(in_ptr0 + (y0 + (128*x2) + (512*y1)), xmask, eviction_policy='evict_last')
    tl.store(out_ptr0 + (x2 + (4*y3)), tmp0, xmask)
```

For a trition block with xyz index (0, 0, 1), its yoffset and xoffset are both 0s based on the compuation `yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK` and `xoffset = tl.program_id(0) * XBLOCK`. So, this triton block will access the very first elements of the input.  However, the correct yoffset should be `(y_index + z_index * y_grid ) * YBLOCK` which is the starting position of the 2nd z grid.

At the same time, because we used `y_grid = y_grid // div` to compute the maximum number of element in y dimension, the y_grid is 32768. The total y grids is 32768*2 = 65536, which is less than the actual y grids 65537. So, we should use `y_grid = ceildiv(y_grid, div)` to compute the y grid to save the remaining grids.

#123210 is not about AOTInductor, the root cause is the triton kernel generated by torchinductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127448
Approved by: https://github.com/eellison
2024-05-31 08:01:34 +00:00
Alex Baden
5d316c81be [Inductor] Add 0 initialization to Triton masked loads (#127311)
For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur.

Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized.  Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called.

Fixes #126535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311
Approved by: https://github.com/jansel
2024-05-30 04:50:54 +00:00
lezcano
0fa2c5b049 Fix mask propagation in the presence of where (#125574)
Before, when calling ops.where, masks were not properly propagated. We
now restrict the optimisation to `ops.masked`, which I think it was what
the original code intended to do.

I'm not 100% sure that even in the masked case this code is not
introducing some bugs, but this is a strict improvement over the
previous state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125574
Approved by: https://github.com/peterbell10
ghstack dependencies: #114471, #126783
2024-05-29 23:17:41 +00:00
lezcano
8a21532e53 Fix constant propagation pass (#114471)
This pass was broken in a number of ways, as we were not generating
asserts whenever we took it, even though we need to. While doing so,
we found that the analysis we were using for choosing
whether to generate asserts or not for dynamic shapes was completely
broken.

Eliminating indirect indexing in this way allows for a number of optimisations.
In particular, we can now fuse against these kernels (indirect indexing disallows fusions).

The new strategy is as follows:

- We always propagate sympy expressions if we can.
- If an expression was an indirect_indexing, we call `check_bounds`
- We also call `check_bounds` within `CSEProxy.indirect_indexing`
- The checks are issued in the buffer where they would go if the were used in a load
   - This makes them always be codegen'd before the load and stores
   - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine.

We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure
that issuing an assert plays well with all kinds of C++ vectorisation.

For now, we rely on the logic within `_maybe_evaluate_static` to prove
these bounds. This logic is rather limited though. In the future, we might want
to rely on Z3 here to be able to prove bounds in a more general way.

Supersedes https://github.com/pytorch/pytorch/pull/113068
Fixes https://github.com/pytorch/pytorch/issues/121251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471
Approved by: https://github.com/peterbell10
2024-05-29 09:10:25 +00:00
Peter Bell
6aa5bb1a76 [inductor] Support persistent reductions for dynamic shapes (#126684)
Currently persistent reductions are only supported when the reduction dimension
is static, however we only really need to know that the rnumel is bounded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126684
Approved by: https://github.com/lezcano
2024-05-27 02:30:20 +00:00
Jason Ansel
d5bf3a98db [inductor] Refactor indexing() into triton.py (#127047)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047
Approved by: https://github.com/shunting314
ghstack dependencies: #126944, #126945
2024-05-24 22:46:20 +00:00
Jason Ansel
92433217cb [inductor] Misc refactors (#126945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126945
Approved by: https://github.com/shunting314
ghstack dependencies: #126944
2024-05-24 22:46:20 +00:00
Jason Ansel
1b6e3e3bcb [inductor] Refactor part of IterationRangesEntry into triton.py (#126944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126944
Approved by: https://github.com/shunting314
2024-05-24 22:46:20 +00:00
Shunting Zhang
14c5c753de [inductor] use smaller RBLOCK for expensive reduction kernels (#126477)
Triton sometimes uses less registers for more expensive kernel which results in worse perf ( https://github.com/pytorch/pytorch/issues/126463 ). This may make inductor end up with a sub-optimal config. Use a smaller max RBLOCK if the reduction potentially need many registers.

Will run perf test..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126477
Approved by: https://github.com/jansel
2024-05-22 22:47:10 +00:00
Aaron Orenstein
e4623de4cf typing scheduler.py [2/2]: Apply types (#126656)
Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout.

We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656
Approved by: https://github.com/eellison
2024-05-22 20:33:31 +00:00
Bin Bao
b40fb2de59 [AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575)
Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575
Approved by: https://github.com/chenyang78
ghstack dependencies: #126369
2024-05-21 18:20:20 +00:00