Commit Graph

195 Commits

Author SHA1 Message Date
Wang, Eikan
9921b48558 Extend Inductor to support the third-party backend (#106874)
## Summary

This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression.

## Root Cause

Regarding the C++/OpenMP backend,  `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`.
c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)

In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend.

```python
def init_backend_registration(self):
    if get_scheduling_for_device("cpu") is None:
        from .codegen.cpp import CppScheduling

        register_backend_for_device("cpu", CppScheduling, WrapperCodeGen)

    if get_scheduling_for_device("cuda") is None:
        from .codegen.triton import TritonScheduling

        register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen)
```

## Solution

To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back.

## Compilation Latency Performance Result
We ran a single model benchmark and reproduced the compilation regression:

- Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart`

- W/ PR #100706, the compilation latency is about **57~58**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7
```

- W/O PR #100706, the compilation latency is about **46~47**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7
```

This PR fixed the compilation performance regression.

- W/ this PR #106874, the compilation latency is about **47~48**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874
Approved by: https://github.com/jansel
2023-08-16 04:11:36 +00:00
lezcano
6d899571d6 Simplify sign lowering in triton (#107051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107051
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038, #107039
2023-08-14 21:01:50 +00:00
Shunting Zhang
6696a75ea8 [inductor] make thread order consistent with loop order (#106827)
I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead.

For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels:

- before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d
- after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551

I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827
Approved by: https://github.com/jansel
2023-08-11 17:05:21 +00:00
Peter Bell
fa65df3745 [inductor] Type triton size arguments in the kernel index_dtype (#106870)
`JITFunction._key_of` uses the value of the argument to distinguish between
i32 and i64, but this fails if the value is used in indexing calculations where
the value exceeds `INT_MAX`.

Instead, we should use `index_dtype` which means all indexing calculations are
performed in the same dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870
Approved by: https://github.com/lezcano
ghstack dependencies: #106626
2023-08-10 21:07:25 +00:00
Yanbo Liang
1819fe1324 Revert "Extend Inductor to support the third-party backend (#100706)" (#106652)
This reverts commit 05bd24bb35.

It caused compilation time regression on torchbench, huggingface and dynamic models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-08-05 06:41:08 +00:00
Wang, Eikan
05bd24bb35 Extend Inductor to support the third-party backend (#100706)
This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done.

Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code.

- Python wrapper code generation

  Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions.

- Kernel code generation

  It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions.

  -   [group_fn](71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64))
  - [flush](71c4becda7/torch/_inductor/scheduler.py (L1150))
  - [can_fuse_vertical](71c4becda7/torch/_inductor/scheduler.py (L1006))
  - [can_fuse_horizontal](71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64))
  - [codegen_template](71c4becda7/torch/_inductor/scheduler.py (L1234)) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._
  - [codegen_nodes](71c4becda7/torch/_inductor/scheduler.py (L1234))
  - [codegen_sync](71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._

  The third-party backend needs to inherit from the `Scheduling` class and implement these functions.

Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706
Approved by: https://github.com/jansel
2023-08-02 05:13:51 +00:00
Michael Lazos
5cbd3fc412 [Inductor] Fuse non-foreach ops with foreach ops without iterating over all subnodes (#106008)
Previously, when fusing a single node into a foreach op, the scheduler would iterate over each subnode and check if it can be fused, this PR adds a mapping so that the node to be fused with can be found more quickly by checking dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106008
Approved by: https://github.com/jansel
2023-07-27 21:40:24 +00:00
Jason Ansel
977df45a0f [inductor] Call render() once for templates (#105987)
This is more code, but perhaps easier to understand?  Both @Chillee and @ipiszy expressed confusion that we rendered templates twice to reach a fixed point.  This removes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105987
Approved by: https://github.com/Chillee
2023-07-27 16:34:38 +00:00
Edward Z. Yang
716f37cef8 If we can't statically prove 32-bit indexing OK, only add guard if hint exists (#106004)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106004
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-07-26 16:36:29 +00:00
SherlockNoMad
a44f8894fa [Inductor] Provenance tracking for wrapper code (#105717)
Summary:
Add comments in wrapper code for better provenance tracking

Sample inductor wrapper output:
```
# Source Nodes: [mm_1], Original ATen: [aten.mm]
extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1)

# Source Nodes: [l__self___linear], Original ATen: [aten.addmm]
extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0)
```

in cpp wrapper
```
        // Source Nodes: [bmm_1], Original ATen: bmm
        at::bmm_out(buf0, arg0_1, arg1_1);
```

Test Plan: OSS CI

Differential Revision: D47657260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-07-21 23:06:43 +00:00
Shunting Zhang
1e87778552 [inductor] refactor wrapper benchmark code out of utils.py (#105584)
Refactor wrapper benchmark out of utils.py since
1. utils.py gets too large
2. I plan to add more code to wrapper benchmark for multi-kernel.

This is split out from https://github.com/pytorch/pytorch/pull/103469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584
Approved by: https://github.com/jansel
2023-07-21 00:01:35 +00:00
David Berard
28d018dafd [inductor] Implement bucketize() for dependencies.py (#105102)
dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed.

Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined.

Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102
Approved by: https://github.com/eellison
2023-07-17 19:15:00 +00:00
Shunting Zhang
8c479d32da [inuctor][easy] avoid duplicate kernel definitions (#105099)
When running BertForMaskedLM , I found if I enable the kernel benchmark, essentially identical kernels will be defined once for each call site. The reason is the benchmark harness of those kernels uses different seed_offset for each invocation. We should be safe to just force seed_offset to be 0 so we can deduplicate identical kernel definitions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105099
Approved by: https://github.com/jansel
2023-07-17 05:34:09 +00:00
PyTorch MergeBot
e68cf02420 Revert "[inductor] Implement bucketize() for dependencies.py (#105102)"
This reverts commit cff5d6a22c.

Reverted https://github.com/pytorch/pytorch/pull/105102 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105102#issuecomment-1637261924))
2023-07-17 01:22:19 +00:00
David Berard
cff5d6a22c [inductor] Implement bucketize() for dependencies.py (#105102)
dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed.

Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined.

Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102
Approved by: https://github.com/eellison
2023-07-14 19:54:06 +00:00
lezcano
c099b7e07a ValueRange analysis for indirect indexing (#102611)
We do so by forwarding ValueRange analysis from IR buffers to CSEvars

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102611
Approved by: https://github.com/eellison, https://github.com/peterbell10
2023-07-14 13:43:05 +00:00
lezcano
88dcecdf54 Remove unnecessary casting in triton (#104975)
This used to be necessary before we advanced the pin past https://github.com/openai/triton/pull/1641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104975
Approved by: https://github.com/peterbell10, https://github.com/Chillee
2023-07-14 13:43:05 +00:00
Peter Bell
66fb83293e [inductor] Add min/max to index propagation pass (#105020)
This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing
into direct indexing expressions. I also add support to the cpp printer for
Min/Max and fix the triton printer to support multi-argument Min/Max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020
Approved by: https://github.com/lezcano
2023-07-12 19:03:01 +00:00
Nikita Karetnikov
49a2b72927 [inductor] handle Min and Max in TritonPrinter (#104944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104944
Approved by: https://github.com/ezyang
2023-07-11 17:11:31 +00:00
Edward Z. Yang
6059fea760 Make perf_hint_log report at info level (#104873)
If you do it at warning, these log messages will get displayed by
default, which is not the intended behavior.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104873
Approved by: https://github.com/mlazos
2023-07-10 23:46:34 +00:00
Edward Z. Yang
0300be5b7b Fix AttributeError("'constexpr' object has no attribute 'type'") (#104831)
Fixes #104759

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104831
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2023-07-10 23:26:42 +00:00
Peter Bell
bcdd4130b4 [inductor] Fix float64 constants in triton codegen (#104830)
Fixes #101684

Before this change, we get a float constant in triton
```
tmp0 = 0.2
```
which in triton IR becomes a float32 value
```
%cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf32>
```

After, we get a tensor with explicit type
```
tmp0 = tl.full([1], 0.2, tl.float64)
```
which does generate a float64 in the triton IR
```
%cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf64>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104830
Approved by: https://github.com/lezcano
2023-07-10 19:40:50 +00:00
Peter Bell
e80787c8e1 [inductor] Split ops.reduction into reduction and store_reduction (#102737)
This is intended as a first step towards reductions with multiple outputs. This
also incidentally improves CSE of reductions under C++ codegen. For example,
```python
def fn(x):
    return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1)
```

Currently this generates two reductions, where the common load is CSEd
```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
    if (tmp_acc1.value > tmp0) {
        tmp_acc1.index = i1; tmp_acc1.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
auto tmp2 = tmp_acc1.index;
out_ptr1[static_cast<long>(i0)] = tmp2;
```

but with this change it gets CSEd to a single accumulator

```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
out_ptr1[static_cast<long>(i0)] = tmp1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-07-08 20:48:29 +00:00
Peter Bell
0ceca92f80 [inductor] Add single pass "var_unnormalized" reduction_type (#102486)
This is a bit inefficient because it computes the mean and throws it
away since ir.Reduction nodes only have 1 output. However, the mean
can at least be scheduled into the same loop as the variance now since
there is no data dependency. Thus we can take fewer passes over the
data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-08 20:48:29 +00:00
David Berard
2df939aaca [inductor] Update ops.bucketize to take offsets_size as a sympy.Expr (#104756)
Background/problem: ops.bucketize needs to take a value `offsets_size`, which is the length of the `offsets` tensor. It is used, e.g., for the bounds of the binary search over the `offsets` tensor. The previous implementation of `ops.bucketize` expected `offsets_size` to be a CSEVariable; i.e. we'd pass `offsets_size = ops.index_expr(offsets.get_size()[0])` into `ops.bucketize()`.  However, `ops.index_expr` will sometimes broadcast, turning the scalar `offsets_size` into a tensor. That caused errors, because [triton_helpers.bucketize_binary_search](a2fe6953bc/torch/_inductor/triton_helpers.py (L153-L155)) expects `offsets_size` to be a scalar. [Link - where the broadcasting happens](a2fe6953bc/torch/_inductor/codegen/triton.py (L1056))

Solution (this PR): Instead of passing `offsets_size` into `ops.bucketize` as a CSEVariable, pass in a sympy.Expr. Then, inside ops.bucketize, convert the sympy.Expr into a string that can be used in the generated triton code.

Differential Revision: [D47282413](https://our.internmc.facebook.com/intern/diff/D47282413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104756
Approved by: https://github.com/jansel
2023-07-08 01:08:55 +00:00
David Berard
d8cb80e382 [inductor] If a kernel contains bucketize, try using config with num_elements_per_warp=32 (#104456)
In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values.

This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning

Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5.

Before:
```
Eager 0.30088499188423157 ms
PT2   0.9296960234642029 ms
```

After:
```
Eager 0.3011910021305084 ms
PT2   0.22977299988269806 ms
```

Differential Revision: [D47237103](https://our.internmc.facebook.com/intern/diff/D47237103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104456
Approved by: https://github.com/eellison
2023-07-07 20:32:41 +00:00
PyTorch MergeBot
8ca63ff9a8 Revert "[inductor] Add single pass "var_unnormalized" reduction_type (#102486)"
This reverts commit 7e098f9559.

Reverted https://github.com/pytorch/pytorch/pull/102486 on behalf of https://github.com/clee2000 due to sorry but this seems to have broken inductor/test_torchinductor.py::CpuTests::test_std_cpu on mac x86 64 machines 7e098f9559 https://github.com/pytorch/pytorch/actions/runs/5479008241/jobs/9981443710 ([comment](https://github.com/pytorch/pytorch/pull/102486#issuecomment-1624739465))
2023-07-07 04:57:20 +00:00
PyTorch MergeBot
1280b19827 Revert "[inductor] Split ops.reduction into reduction and store_reduction (#102737)"
This reverts commit 59b8d5be74.

Reverted https://github.com/pytorch/pytorch/pull/102737 on behalf of https://github.com/clee2000 due to sorry but i need to revert this to revert the other one in the stack ([comment](https://github.com/pytorch/pytorch/pull/102737#issuecomment-1624735108))
2023-07-07 04:53:14 +00:00
Shunting Zhang
a358a9262e [inductur] coordesc tuner bug fix with no_x_dim kernel (#104692)
We recently have an optimization to squash x dimension for persistent reduction kernel when we are confident that XBLOCK will always be 1.  We need update the code so that coordinate descent tuner does not tune XBLOCK in this case.

Test command. Fail before the fix and pass after.
```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --inference
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104692
Approved by: https://github.com/jansel
2023-07-06 17:47:02 +00:00
Peter Bell
59b8d5be74 [inductor] Split ops.reduction into reduction and store_reduction (#102737)
This is intended as a first step towards reductions with multiple outputs. This
also incidentally improves CSE of reductions under C++ codegen. For example,
```python
def fn(x):
    return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1)
```

Currently this generates two reductions, where the common load is CSEd
```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
    if (tmp_acc1.value > tmp0) {
        tmp_acc1.index = i1; tmp_acc1.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
auto tmp2 = tmp_acc1.index;
out_ptr1[static_cast<long>(i0)] = tmp2;
```

but with this change it gets CSEd to a single accumulator

```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
out_ptr1[static_cast<long>(i0)] = tmp1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-07-06 16:22:19 +00:00
Peter Bell
7e098f9559 [inductor] Add single pass "var_unnormalized" reduction_type (#102486)
This is a bit inefficient because it computes the mean and throws it
away since ir.Reduction nodes only have 1 output. However, the mean
can at least be scheduled into the same loop as the variance now since
there is no data dependency. Thus we can take fewer passes over the
data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-06 00:00:59 +00:00
lezcano
7ae100628e Move most SymPy functions to their own file (#104556)
All these are standalone implementations of some functions and they
don't depend on anything else, so we better have them under the
`_sympy/` folder on their own

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556
Approved by: https://github.com/ezyang
2023-07-04 03:53:48 +00:00
David Berard
e9d2d74f0a [inductor] Add prims._inductor_bucketize and add lowerings (#104007)
**TL;DR**: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this  implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize.

**Background - torch.bucketize**: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2].

**Implementation**: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in `triton_helpers.py`. This PR also adds a new prim `_inductor_bucketize()` for testing purposes and adds lowering for this op.

~~**"right"**: The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. **I'm open to switching the behavior of the inductor op** - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize).~~ Switched the behavior of the inductor bucketize op to match the torch op

* places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior)
  + current torch.bucketize behavior
  + table in [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html)
* places where "right" means "the right boundary of a bucket is closed":
  + the text description of [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) (observed in #91580)
  + [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) (which is basically the same op)

**Performance**: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets).

As is:
```
Eager 0.30117499828338623 ms
PT2   0.9298200011253357 ms
```

But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456):
```
Eager 0.3015420138835907 ms
PT2   0.23028500378131866 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104007
Approved by: https://github.com/jansel
2023-07-03 16:52:38 +00:00
Jack Taylor
80ea3422f0 [ROCm] Enable tl.reduce usage on ROCm (#104099)
Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm https://github.com/pytorch/pytorch/pull/102444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104099
Approved by: https://github.com/peterbell10, https://github.com/malfet
2023-06-27 16:21:32 +00:00
Michael Lazos
3e674b75b1 Allow fusion of epilogue copies with upstream foreach ops (#104018)
Allow fusion of epilogue copies with foreach kernel scheduler nodes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104018
Approved by: https://github.com/jansel
2023-06-23 21:39:59 +00:00
Peter Bell
d7994dfd07 [inductor] Add triton_helpers.any instead of reusing max (#103974)
I doubt there's much difference in performance, but this improves readability of
the generated code, e.g.

```python
tmp8 = triton_helpers.max2(tmp7, 1)[:, None]
```
becomes
```python
tmp8 = triton_helpers.any(tmp7, 1)[:, None]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103974
Approved by: https://github.com/lezcano
2023-06-22 20:06:21 +00:00
Antoni Viros i Martin
0d653730ce Refactory bits for the codegen cache (#103452)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103452
Approved by: https://github.com/ezyang
2023-06-22 13:04:22 +00:00
Peter Bell
b1adaa8777 [inductor] Fix no-xdim reductions (#103527)
Fixes #103481

Normally triton tensors have shape `[XBLOCK, RBLOCK]`, or some variation where
the lengths are 1 but the number of dimensions is the same. The `no_x_dim`
change in addition to removing the x dimension, also removed the r dimension
from certain values such as the results of reductions and the `xindex` variable.

This fixes those two cases to correctly produce tensors of shape `[1]`,
equivalent to the old shape `[XBLOCK, 1]` with the x-dimension dropped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103527
Approved by: https://github.com/ngimel
2023-06-14 16:32:17 +00:00
Peter Bell
ccf56eca84 [inductor] Fix is_broadcasted (#103514)
Fixes #103491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103514
Approved by: https://github.com/ngimel
2023-06-14 13:30:48 +00:00
Edward Z. Yang
597e2a11a3 indexing_dtype_strength_reduction more aggressive free_symbols tests (#103470)
ValueRanges can't handle symbolic bounds. Be a bit more careful about detecting if you try to pass in expressions with free symbols, and fall back to "don't know" range if this occurs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103470
Approved by: https://github.com/eellison
2023-06-13 16:00:41 +00:00
Edward Z. Yang
c3fdfca5da Always create ShapeEnv, always apply unspec logic (#103302)
Originally, my goal for this PR was to remove the `dynamic_shapes` tests in torch/_dynamo/variables/builder.py. However, one thing lead to another, and it turns out that it was easiest to do all of the following in one go:

* Unconditionally allocate a ShapeEnv, no matter if dynamic_shapes is enabled or not (torch/_dynamo/output_graph.py). There is a small adjustment to export torch/_dynamo/eval_frame.py to account for the fact that a ShapeEnv always exists, even if you're not doing symbolic export.
* Remove dynamic_shapes test from unspec logic (torch/_dynamo/variables/builder.py), the original goal
* Specialize strides and storage offset if all sizes are dynamic (torch/fx/experimental/symbolic_shapes.py). This is required to deal with unconditional ShapeEnv: if a ShapeEnv exist, fake tensor-ification may choose to allocate symbols. The idea is that with `automatic_dynamic_shapes == False`, Dynamo should never request dynamic sizes, but this invariant was not upheld for nontrivial strides/offset.

The rest are just auxiliary fixups from the above:

* Workaround bug in FakeTensorProp where sometimes it doesn't return a FakeTensor (torch/fx/passes/fake_tensor_prop.py), see https://github.com/pytorch/pytorch/pull/103395 for follow up
* Make ShapeProp correctly handle int inputs (torch/fx/passes/shape_prop.py)
* Disable indexing strength reduction if `assume_static_by_default` is False (torch/_inductor/codegen/triton.py)
* Fix hf_T5_generate to NOT toggle `assume_static_by_default` if dynamic shapes is not enabled (benchmarks/dynamo/common.py); technically this is not necessary anymore but it's in for safety.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103302
Approved by: https://github.com/voznesenskym
2023-06-12 12:48:28 +00:00
Yanbo Liang
686d7e4c48 [Inductor] Fix x.view(dtype) decomp and make inductor support it (#102920)
Fixes #99804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102920
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-06-07 17:10:54 +00:00
Edward Z. Yang
f760899864 Teach Triton codegen to generate sqrt (#103084)
Fixes https://github.com/pytorch/pytorch/issues/100972

I know ngimel doesn't like this sort of fix because we shouldn't
actually be computed sqrt at runtime, I'm open to some sort of
perf warning saying that we're spending FLOPs weirdly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103084
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ngimel
2023-06-07 03:03:56 +00:00
Bin Bao
fbbde8df69 [inductor] fix a numel expr codegen issue (#103005)
Summary: Correctly use pexpr or cexpr for generating symbolic expression
during wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103005
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
Bin Bao
44fdfd3222 [inductor] Support select_algorithm with cpp_wrapper (#103003)
Summary: This is one step towards getting cpp_wrapper work with max_autotune.
Switch to use unique kernel name to cache generated cubin file.

This is a copy of https://github.com/pytorch/pytorch/pull/102738 to solve a ghstack issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103003
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
lezcano
2c2e4d5228 Populate the eviction_policy field for load/store properly (#91316)
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91316
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-06-05 13:54:36 +00:00
Shunting Zhang
86c7652503 [inductor] layout optimization for conv (#99773)
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773
Approved by: https://github.com/jansel
2023-06-02 21:08:18 +00:00
Aleksandar Samardžić
51e0f9e858 Add missing decompositons/lowerings for logical/bitwise operators (#102566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566
Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5
2023-06-02 14:27:17 +00:00
Peter Bell
2f96981e5a [inductor] Reduce duplication of reduction combine functions (#99661)
Currently reduction bodies are duplicated in several different places.
This reduces duplication by `combine_fn` definition used in
`_unroll_reduction_fn` and using it in the triton codegen. For cpp
this also makes better use of `reduction_combine{,_vec}` by using them
to generate the `omp declare reduction` line and the `vec_reduce_all`
call.

For triton the only change is that that the combine step gets spread
over two lines, e.g. instead of:
```python
_tmp1 = tl.where(rmask & xmask, triton_helpers.maximum(_tmp1, tmp0), _tmp1)
```
we get
```python
tmp2 = triton_helpers.maximum(_tmp1, tmp0)
_tmp1 = tl.where(rmask & xmask, tmp2, _tmp1)
```

For cpp the only change is that inplace reduction operations are now written as
an out-of-place operation and an assignment, e.g. instead if
```cpp
omp_out += omp_in
```
we generate
```cpp
omp_out = omp_out + omp_in
```

Which is a purely cosmetic change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99661
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-06-01 18:02:17 +00:00
Bin Bao
c58264c3e9 [inductor] Support multiple symbolic numel expr in CudaWrapperCodeGen (#102093)
Summary: Add a set to avoid generating extra `auto` when seeing the
symbolic numel expression for the second time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102093
Approved by: https://github.com/jansel
2023-05-30 16:08:00 +00:00