Commit Graph

481 Commits

Author SHA1 Message Date
leslie-fang-intel
73a6a40346 [Inductor][CPP] Fix outer loop fusion buffer removed (#144243)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest`

-  `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)`

- `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)`

Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243
Approved by: https://github.com/jgong5
2025-01-07 01:17:46 +00:00
blzheng
c09bf71bd6 [Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848)
Fix https://github.com/pytorch/pytorch/issues/143568
Before:
![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003)
After:
![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-01-03 09:00:43 +00:00
xinan.lin
01034e963c [AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970)
Fix #143967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-31 06:28:32 +00:00
leslie-fang-intel
74028cfd0c [Inductor][CPP] Fix Data Type issue of frexp (#143746)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746
Approved by: https://github.com/jgong5
2024-12-28 06:00:13 +00:00
leslie-fang-intel
607884c9af [Inductor][CPP] Fix bitwise shift with corner inputs (#143635)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: 29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501) at these corner inputs.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635
Approved by: https://github.com/jgong5
2024-12-20 13:47:40 +00:00
leslie-fang-intel
00b0210139 [Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
2024-12-14 00:27:55 +00:00
Tom Ritchford
da67a6a7bb [inductor] Replace set by OrderedSet (#138466)
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
eellison
b731ced91f Prologue Fusion (#134532)
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.

Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`

And the modification api:

```
{{ modification(
    subgraph_number=0,
    output_name="post_mod_scores",
    score="qk",
    out="qk"
) | indent_except_first(1) }}
```

We have:

```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```

Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.

There are a couple main use cases for prologue fusion:

- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.

Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066

Other notes:

By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.

With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
2024-12-13 04:18:25 +00:00
Tom Ritchford
dc23f1944a Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-12 17:39:14 +00:00
PyTorch MergeBot
cd1b5924d5 Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)"
This reverts commit 79cf8fa751.

Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))
2024-12-12 14:42:55 +00:00
leslie-fang-intel
79cf8fa751 [Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
2024-12-12 05:40:48 +00:00
leslie-fang-intel
06075d3d18 [Inductor][CPP] Fix Mask Dtype mismatch (#142103)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103
Approved by: https://github.com/jgong5
2024-12-12 01:21:32 +00:00
PyTorch MergeBot
233853a66f Revert "Prologue Fusion (#134532)"
This reverts commit 59ab3825e7.

Reverted https://github.com/pytorch/pytorch/pull/134532 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:26 +00:00
PyTorch MergeBot
5c97ac9721 Revert "Remove unused Python variables in torch/[_-a]* (#133492)"
This reverts commit fda975a7b3.

Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))
2024-12-11 17:29:12 +00:00
Tom Ritchford
fda975a7b3 Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-10 21:48:44 +00:00
eellison
59ab3825e7 Prologue Fusion (#134532)
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.

Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`

And the modification api:

```
{{ modification(
    subgraph_number=0,
    output_name="post_mod_scores",
    score="qk",
    out="qk"
) | indent_except_first(1) }}
```

We have:

```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```

Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.

There are a couple main use cases for prologue fusion:

- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.

Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066

Other notes:

By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.

With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
2024-12-10 16:25:57 +00:00
Jason Ansel
0367a31401 [inductor] Minor typing changes (#142219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219
Approved by: https://github.com/Skylion007, https://github.com/yanboliang
2024-12-07 17:48:37 +00:00
Mwiza Kunda
f8a64c324e Broadcast constants on vectorised stores in CppTile2DKernel (#140262)
Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following:
```shell
error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char'
   61 |                                 tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8));
      |                                           ^~~~~
```
This PR adds the required broadcasting.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262
Approved by: https://github.com/jgong5
2024-12-03 09:15:17 +00:00
Aaron Gokaslan
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
PyTorch MergeBot
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
Aaron Gokaslan
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
Sun, Jiayi
a964f31d7b [inductor] modify the heuristic for loop split optimization (#137550)
### Summary

1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned).
2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body.

This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-11-25 09:16:30 +00:00
haozhe.zhu
d0fd42eb3a [inductor] refine loop split logic (#128812)
This PR aims to improves parallelization by collapsing vectorized loop. https://github.com/pytorch/pytorch/issues/122281

For such case, the parallel level is only `2`.
And the vectorized loop cannot be collapsed.
```
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
{
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L))
    {
        auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16);
        tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16);
    }
    #pragma omp simd simdlen(8)
    for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L))
    {
        auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))];
        out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0;
    }
}
```
After this PR, we will gen code
```
#pragma omp for collapse(2)
for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
{
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L))
    {
        if (x1 >= 0 && x1 <199984) {
            auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16);
            tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16);
        }
        if (x1 >= 199984 && x1 <199985) {
            auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))];
            out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0;
        }
    }
}
```

### Highlight
For reduction case, we have some side-effect here.
For below case, we vectorized `x1` dim and reduction at `x2` dim.
```
#pragma omp for
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L))
{
    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L))
    {
        {
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8);
                tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
            }
            [&]
            {
                __at_align__ std::array<float, 8> tmpbuf;
                tmp_acc0_vec.store(tmpbuf.data(), 8);
                #pragma GCC unroll 8
                for (long x1_inner = 0; x1_inner < 8; x1_inner++)
                {
                    out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner];
                }
            }
            ()
            ;
        }
    }
    #pragma omp simd simdlen(4)
    for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L))
    {
        {
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))];
                tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
            }
            out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0;
        }
    }
}

```
After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops.
```
#pragma omp for collapse(2)
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L))
{
    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L))
    {
        {
            float tmp_acc0_arr[8];           ######### need an array to hold acc result for tail part
            for (int i = 0; i < 8; i++)
            {
                tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity();
            }
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8);
                        tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L)))
                    {
                        for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++)
                        {
                            auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))];
                            tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0);
                        }
                    }
                }
            }

            ############### reduction stores
            if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L)))
            {
                [&]
                {
                    __at_align__ std::array<float, 8> tmpbuf;
                    tmp_acc0_vec.store(tmpbuf.data(), 8);
                    #pragma GCC unroll 8
                    for (long x1_inner = 0; x1_inner < 8; x1_inner++)
                    {
                        out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner];
                    }
                }
                ()
                ;
            }
            if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L)))
            {
                for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++)
                {
                    out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)];
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128812
Approved by: https://github.com/jgong5
2024-11-25 04:46:07 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
eellison
eff22171d2 Add Current Mask Var To CSE Cache Key (#140838)
This torch.cat kernel has multiple subblocks which load from the same input. We were incorrectly reusing the mask vars from the first load for the second load.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140838
Approved by: https://github.com/jansel
ghstack dependencies: #140841
2024-11-20 00:55:56 +00:00
Valentine233
6fcef86cfa [inductor] fix the unligned variable ranges issue in fuse node (#138568)
Fixes #138550.

### Description
In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up.

### Solution
For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-11-07 01:17:58 +00:00
Aaron Orenstein
06f619d999 typing ir.py - part 2 (#131846)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846
Approved by: https://github.com/eellison
ghstack dependencies: #139238
2024-11-06 00:01:15 +00:00
Sun, Jiayi
3337439dc0 [inductor] modify the heuristic for disabling vectorization (#136422)
Summary
Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-11-04 07:33:32 +00:00
Jason Ansel
ed30fa74ab [inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364, #139365, #139370, #139452
2024-11-04 04:28:40 +00:00
PyTorch MergeBot
98e11b0021 Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)"
This reverts commit c53beab377.

Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
Jason Ansel
c53beab377 [inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364, #139365, #139370, #139452
2024-11-02 03:04:22 +00:00
Scott Wolchok
a3de067975 [PyTorch] Use 128-bit vectors for ARM64 (#137426)
The correct vector length for ARM64 is 128 bits (16
bytes). We were previously using double this, apparently just because
that would be the same length as AVX2.

Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655, #138716, #138744
2024-10-26 00:20:35 +00:00
Scott Wolchok
6aa673377b [PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where (#138486)
In this case, it looks like we expect the body to be a VecMask (unify_mask_base_type is called by where()), but we didn't make it a VecMask. Now we do.

Differential Revision: [D64702918](https://our.internmc.facebook.com/intern/diff/D64702918/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138486
Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet
2024-10-24 19:36:41 +00:00
chilli
0e4d42634e Port Inductor dataclasses to be kw_only (#137768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768
Approved by: https://github.com/ezyang
2024-10-14 10:33:43 +00:00
PyTorch MergeBot
41977a0531 Revert "Port Inductor dataclasses to be kw_only (#137768)"
This reverts commit 65d665bae5.

Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))
2024-10-13 22:25:19 +00:00
chilli
65d665bae5 Port Inductor dataclasses to be kw_only (#137768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768
Approved by: https://github.com/ezyang
2024-10-13 14:55:45 +00:00
Bin Bao
c04b35a5ae [AOTI] Add standalone version of TORCH_CHECK (#136873)
Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained.

Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873
Approved by: https://github.com/albanD, https://github.com/chenyang78
2024-10-08 15:30:01 +00:00
Jez Ng
71aac59e93 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-30 20:24:52 +00:00
PyTorch MergeBot
36428f91e9 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit 31c0467594.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))
2024-09-27 16:54:27 +00:00
Jez Ng
31c0467594 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-26 15:35:26 +00:00
Wu, Chunyuan
c3fdf587b5 [inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518)
The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes.  When there's no epilogue nodes, the function could return `False` directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #136418
2024-09-25 10:25:07 +00:00
Wu, Chunyuan
44c871c34b [inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661)
## Description
Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune.

The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same).

If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`.

This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied.

### Unsupported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
401408 * d0 + 100352 * d1 + **7168 * d2** + **1792 * d3** + 128 * d4 + d5

The load of `buf1` in the epilogue node:
401408 * d0 + 100352 * d1 + **1792 * d2** + **25088 * d3** + 128 * d4 + d5

The above two indexes are different.

```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      i0, i1, i2, i3, i4, i5 = index
      tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp2 = tmp0 + tmp1
      tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp4 = tmp2 + tmp3
      return tmp4
  ,
  ranges=[8, 4, 14, 4, 14, 128],
  origin_node=clone,
  origins=OrderedSet([clone])
))
```

### Supported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
d0 + 576 * d1 + 32 * d2

The load of `buf1` in the epilogue node:
d0 + 576 * d1 + 32 * d2

The above two indexes are the same.

The layout of `buf2` and `buf1` are different though which is handled by the reindexer:
`buf1`: `size=[324, 32], stride=[32, 1]`
`buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]`

```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise(
  'cpu',
  torch.bfloat16,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2)
      tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
      tmp2 = ops.load(_frozen_param4, i1)
      tmp3 = tmp1 * tmp2
      tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2)
      tmp5 = tmp3 + tmp4
      tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32)
      return tmp6
  ,
  ranges=[1, 32, 18, 18],
  origin_node=convert_element_type_4,
  origins=OrderedSet([add, mul, convert_element_type_4])
))
```

## TODO
Add the support for fusions when the indexes are different in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-09-24 05:25:28 +00:00
Aaron Orenstein
06909803cc Existing mypy issues (#136236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2024-09-24 01:02:07 +00:00
Sun, Jiayi
687e5cf8c5 [inductor] Relax the conditions for loop split (#135335)
Summary
This PR Relaxes the conditions for loop split to support dynamic shape cases.
Now the conditions that need to be met to apply loop split optimization are as follows:

1. No reduction and no mudular index for all nodes.
2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs.

Example:
```
import torch
import torch.nn as nn

class GN(torch.nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GN(32, 960).eval()
compiled_m = torch.compile(m, dynamic=True)

with torch.no_grad():
    compiled_m(input)
```

Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960*s2**2*z0 + 960*s2*z1 + 960*z2 + z3, 'index1': 32*z0 + (z3//30), 'index2': 30*s2**2, 'index3': z3, 'index4': 960*s2*z0*((s2**2//s2)) + 960*z1*((s2**2//s2)) + 960*z2 + z3}`. After loop split `z3` will split to `30*z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960*s2**2*z0 + 960*s2*z1 + 960*z2 + 30*z3 + z4, 'index1': 32*z0 + z3, 'index2': 30*s2**2, 'index3': 30*z3 + z4, 'index4': 960*s2*z0*((s2**2//s2)) + 960*z1*((s2**2//s2)) + 960*z2 + 30*z3 + z4}`

Generated code:

- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'const int64_t', 'const int64_t'], '''
#include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       const int64_t ks0,
                       const int64_t ks1)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(8L))));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1*ks1)); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L))
                    {
                        #pragma GCC ivdep
                        for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L))
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1))))];
                            auto tmp1 = out_ptr0[static_cast<int64_t>((32L*x0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))];
                            auto tmp3 = out_ptr1[static_cast<int64_t>((32L*x0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))];
                            auto tmp11 = in_ptr1[static_cast<int64_t>(x3)];
                            auto tmp13 = in_ptr2[static_cast<int64_t>(x3)];
                            auto tmp2 = decltype(tmp0)(tmp0 - tmp1);
                            auto tmp4 = 30L*(static_cast<int64_t>(ks1*ks1));
                            auto tmp5 = c10::convert<float>(tmp4);
                            auto tmp6 = tmp3 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = decltype(tmp2)(tmp2 * tmp9);
                            auto tmp12 = decltype(tmp10)(tmp10 * tmp11);
                            auto tmp14 = decltype(tmp12)(tmp12 + tmp13);
                            out_ptr2[static_cast<int64_t>(x3 + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))))] = tmp14;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    s0 = arg2_1
    s2 = arg3_1
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg4_1, (s0, 960, s2, s2), (960*(s2*s2), 1, 960*s2, 960))
    buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf3 = empty_strided_cpu((s0, 960, s2, s2), (960*s2*((s2*s2) // s2), 1, 960*((s2*s2) // s2), 960), torch.float32)
    cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2)
    del arg0_1
    del arg1_1
    del arg4_1
    return (buf3, )
```

After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'const int64_t', 'const int64_t'], '''
#include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       const int64_t ks0,
                       const int64_t ks1)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(8L))));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1*ks1)); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L))
                    {
                        #pragma GCC ivdep
                        for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(16));
                                auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(16));
                                auto tmp2 = at::vec::Vectorized<float>(tmp1);
                                auto tmp3 = tmp0 - tmp2;
                                auto tmp5 = 30L*(static_cast<int64_t>(ks1*ks1));
                                auto tmp6 = c10::convert<float>(tmp5);
                                auto tmp7 = tmp4 / tmp6;
                                auto tmp8 = static_cast<float>(1e-05);
                                auto tmp9 = decltype(tmp7)(tmp7 + tmp8);
                                auto tmp10 = 1 / std::sqrt(tmp9);
                                auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                auto tmp12 = tmp3 * tmp11;
                                auto tmp14 = tmp12 * tmp13;
                                auto tmp16 = tmp14 + tmp15;
                                tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1))))));
                            }
                            for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(14L));
                                auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(14L));
                                auto tmp2 = at::vec::Vectorized<float>(tmp1);
                                auto tmp3 = tmp0 - tmp2;
                                auto tmp5 = 30L*(static_cast<int64_t>(ks1*ks1));
                                auto tmp6 = c10::convert<float>(tmp5);
                                auto tmp7 = tmp4 / tmp6;
                                auto tmp8 = static_cast<float>(1e-05);
                                auto tmp9 = decltype(tmp7)(tmp7 + tmp8);
                                auto tmp10 = 1 / std::sqrt(tmp9);
                                auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                auto tmp12 = tmp3 * tmp11;
                                auto tmp14 = tmp12 * tmp13;
                                auto tmp16 = tmp14 + tmp15;
                                tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    s0 = arg2_1
    s2 = arg3_1
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg4_1, (s0, 960, s2, s2), (960*(s2*s2), 1, 960*s2, 960))
    buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf3 = empty_strided_cpu((s0, 960, s2, s2), (960*s2*((s2*s2) // s2), 1, 960*((s2*s2) // s2), 960), torch.float32)
    cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2)
    del arg0_1
    del arg1_1
    del arg4_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-09-20 05:42:52 +00:00
PyTorch MergeBot
d0cebedb31 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit e498b02b47.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
Jez Ng
e498b02b47 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00
Bin Bao
ea2ecab15b [AOTI][reland] Fix assert_function call in cpu autotune template (#135920)
Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Test Plan: CI

Differential Revision: D62500592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920
Approved by: https://github.com/chenyang78
2024-09-13 12:21:57 +00:00
xinan.lin
13ee85ca5e [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312)
[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison
2024-09-11 23:59:54 +00:00
PyTorch MergeBot
0a9d55d2ee Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086)"
This reverts commit 16c3b8f87c.

Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))
2024-09-10 19:51:16 +00:00
Bin Bao
16c3b8f87c [AOTI] Fix assert_function call in cpu autotune template (#135086)
Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086
Approved by: https://github.com/chenyang78, https://github.com/angelayi
ghstack dependencies: #134857
2024-09-09 16:54:12 +00:00