Commit Graph

1609 Commits

Author SHA1 Message Date
eellison
fd35be2fd3 TritonTemplate dtype fixes (#141991)
- Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype
- Update dtype propagation to use `type_to_dtype` instead of instantiating tensor
- Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991
Approved by: https://github.com/drisspg
ghstack dependencies: #139945, #140057, #141495, #141882
2024-12-04 17:24:23 +00:00
PyTorch MergeBot
38d10a1b17 Revert "[Inductor] Represent tiling as a dict (#141751)"
This reverts commit 5deca07c0d.

Reverted https://github.com/pytorch/pytorch/pull/141751 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/141751#issuecomment-2517815899))
2024-12-04 15:43:16 +00:00
Shangdi Yu
7dfb439a2a Only write predicate once when there are multiple torch.cond (#141528)
Fixes #140606

TEST PLAN:

```
python test/inductor/test_aot_inductor.py -k cond_share
python test/inductor/test_aot_inductor_arrayref.py -k cond_share
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141528
Approved by: https://github.com/desertfire
2024-12-04 01:56:10 +00:00
Bin Bao
a51a048027 [AOTI][refactor] Move stack allocation related configs (#139093)
Summary: Move allow_stack_allocation and use_minimal_arrayref_interface configs into the aot_inductor subclass.

Test Plan: CI

Differential Revision: D65064301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139093
Approved by: https://github.com/chenyang78
2024-12-04 00:15:19 +00:00
Mwiza Kunda
f0b33658f8 Dont use constant mask if ynumel potentially overflows ygrids (#139751)
If (ynumel / YBLOCK)  > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751
Approved by: https://github.com/eellison

Co-authored-by: George White <georgew@graphcore.ai>
2024-12-03 22:56:18 +00:00
Mwiza Kunda
f8a64c324e Broadcast constants on vectorised stores in CppTile2DKernel (#140262)
Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following:
```shell
error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char'
   61 |                                 tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8));
      |                                           ^~~~~
```
This PR adds the required broadcasting.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262
Approved by: https://github.com/jgong5
2024-12-03 09:15:17 +00:00
Aaron Gokaslan
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
PyTorch MergeBot
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
Benjamin Glass
54adbbf6b8 cpp_wrapper: Add support for MemoryFormat arguments (#141367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367
Approved by: https://github.com/desertfire
2024-12-02 20:40:24 +00:00
Aaron Gokaslan
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
leslie-fang-intel
96d2a511ce [Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor (#141798)
**Summary**
When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example:

```
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]     class sdpa_score30(torch.nn.Module):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]         def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]             return arg0_1

V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_score30 = self.sdpa_score30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_mask30 = self.sdpa_mask30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,));  add_276 = sdpa_score30 = sdpa_mask30 = None
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0];  flex_attention_30 = None
```
We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798
Approved by: https://github.com/jgong5
2024-12-02 07:38:57 +00:00
Adnan Akhundov
f16e08042c [user triton] Fix grid codegen for configs with empty kwargs (#141824)
Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824
Approved by: https://github.com/Chillee
2024-12-02 04:17:21 +00:00
Jason Ansel
b2fe1b9409 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-12-01 19:47:41 +00:00
Blaine Burton Rister
5deca07c0d [Inductor] Represent tiling as a dict (#141751)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions.

This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`.

Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first.

# Test plan
The existing CI provides good coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751
Approved by: https://github.com/jansel
2024-12-01 09:54:34 +00:00
Blaine Burton Rister
c2fa544472 [Inductor] move block pointer analysis to a new module (#141733)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing.

# Test plan

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733
Approved by: https://github.com/jansel
2024-11-30 23:21:24 +00:00
Blaine Burton Rister
49fde426ba [Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel
2024-11-30 22:38:13 +00:00
Nan Zhang
5aacfa037b [Inductor] fix broadcast logic for Triton (#141027) (#141693)
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693
Approved by: https://github.com/blaine-rister
2024-11-28 16:38:25 +00:00
eellison
f83361b274 inductor dtype propagation fixes (#141495)
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.

Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.

Follow ups:

- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
2024-11-28 11:39:38 +00:00
PyTorch MergeBot
b33f770574 Revert "[inductor] Fix 3d tiling (#141709)"
This reverts commit ca9bfa1a38.

Reverted https://github.com/pytorch/pytorch/pull/141709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is one failed test showing up in trunk.  It was missed by target determination ([comment](https://github.com/pytorch/pytorch/pull/141709#issuecomment-2505213481))
2024-11-28 03:55:31 +00:00
Jason Ansel
ca9bfa1a38 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-11-28 01:34:28 +00:00
Boyuan Feng
17fd53d8e5 [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-27 18:51:52 +00:00
eellison
fd553b9817 Add remaining method and tests for dtype propagation (#140057)
Adds the remaining unimplemented ops as well as an assertion failure if someone adds a new op without a dtype rule.

We test all unique pointwise operators registered as lowerings which have an opinfo. There will be some follow ups for this to work well with both `codegen_upcast_to_fp32` as True and False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140057
Approved by: https://github.com/arui-meta, https://github.com/blaine-rister, https://github.com/ezyang
ghstack dependencies: #139945
2024-11-27 17:06:44 +00:00
eellison
566ceb3e7e Refactor dtype propagation (#139945)
A couple changes.

- Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later.

- Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache.

Tests get added later in the stack when everything is implemented.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
2024-11-27 16:57:02 +00:00
leslie-fang-intel
aa827e319e [Inductor][CPP] Extract common functions to be reused in other CPP Template (#141554)
**Summary**
Extract common internal functions from GEMM Template into public function, so these functions can be reused by the  subsequent group GEMM template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141554
Approved by: https://github.com/jgong5
2024-11-27 09:52:18 +00:00
PyTorch MergeBot
65dbd5cc2d Revert "[Inductor] Inplacing with Donated Buffer (#140113)"
This reverts commit eecc8e362c.

Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))
2024-11-26 21:20:59 +00:00
Isuru Fernando
44186a0a4e Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-26 18:11:00 +00:00
Yidi Wu
000d4e9d43 [hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181)
Before the PR: P1683356646
after the pr: P1683356585

Relevant changes:
```
@@ -231,7 +421,8 @@
             true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1]
             del true_graph_0_arg0_1
             del true_graph_0_arg1_1
+            (buf5[0],) = true_graph_0(true_graph_0_args)
-             (true_graph_0_buf0,) = true_graph_0(true_graph_0_args)
-             buf5[0] = true_graph_0_buf0
         else:
             # subgraph: false_graph_0
             false_graph_0_arg0_1 = buf4
@@ -239,7 +430,8 @@
             false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1]
             del false_graph_0_arg0_1
             del false_graph_0_arg1_1
+            (buf5[0],) = false_graph_0(false_graph_0_args)
-             (false_graph_0_buf0,) = false_graph_0(false_graph_0_args)
-             buf5[0] = false_graph_0_buf0
         del arg2_1
         del buf4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181
Approved by: https://github.com/anijain2305
ghstack dependencies: #140334, #141172
2024-11-26 17:32:51 +00:00
Yidi Wu
aae581d921 [hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172
Approved by: https://github.com/Chillee
ghstack dependencies: #140334
2024-11-26 17:32:50 +00:00
Boyuan Feng
eecc8e362c [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-26 17:19:50 +00:00
leslie-fang-intel
9d4c0527b3 [Inductor][CPP] Modularize the CPP GEMM Template (#141006)
**Summary**
Move the common template code, which may be reused in subsequent group GEMM templates, into the standalone sub-templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141006
Approved by: https://github.com/jgong5
2024-11-26 14:32:40 +00:00
xinan.lin
4742080ed9 [AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-11-26 11:51:32 +00:00
Colin Peppler
8f5edcb75c [CUTLASS] Lift shape & stride information as kernel args (#138611)
Differential Revision: [D64773324](https://our.internmc.facebook.com/intern/diff/D64773324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138611
Approved by: https://github.com/chenyang78
2024-11-25 17:52:33 +00:00
Sun, Jiayi
a964f31d7b [inductor] modify the heuristic for loop split optimization (#137550)
### Summary

1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned).
2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body.

This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-11-25 09:16:30 +00:00
haozhe.zhu
d0fd42eb3a [inductor] refine loop split logic (#128812)
This PR aims to improves parallelization by collapsing vectorized loop. https://github.com/pytorch/pytorch/issues/122281

For such case, the parallel level is only `2`.
And the vectorized loop cannot be collapsed.
```
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
{
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L))
    {
        auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16);
        tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16);
    }
    #pragma omp simd simdlen(8)
    for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L))
    {
        auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))];
        out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0;
    }
}
```
After this PR, we will gen code
```
#pragma omp for collapse(2)
for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
{
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L))
    {
        if (x1 >= 0 && x1 <199984) {
            auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16);
            tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16);
        }
        if (x1 >= 199984 && x1 <199985) {
            auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))];
            out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0;
        }
    }
}
```

### Highlight
For reduction case, we have some side-effect here.
For below case, we vectorized `x1` dim and reduction at `x2` dim.
```
#pragma omp for
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L))
{
    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L))
    {
        {
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8);
                tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
            }
            [&]
            {
                __at_align__ std::array<float, 8> tmpbuf;
                tmp_acc0_vec.store(tmpbuf.data(), 8);
                #pragma GCC unroll 8
                for (long x1_inner = 0; x1_inner < 8; x1_inner++)
                {
                    out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner];
                }
            }
            ()
            ;
        }
    }
    #pragma omp simd simdlen(4)
    for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L))
    {
        {
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))];
                tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
            }
            out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0;
        }
    }
}

```
After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops.
```
#pragma omp for collapse(2)
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L))
{
    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L))
    {
        {
            float tmp_acc0_arr[8];           ######### need an array to hold acc result for tail part
            for (int i = 0; i < 8; i++)
            {
                tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity();
            }
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8);
                        tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L)))
                    {
                        for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++)
                        {
                            auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))];
                            tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0);
                        }
                    }
                }
            }

            ############### reduction stores
            if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L)))
            {
                [&]
                {
                    __at_align__ std::array<float, 8> tmpbuf;
                    tmp_acc0_vec.store(tmpbuf.data(), 8);
                    #pragma GCC unroll 8
                    for (long x1_inner = 0; x1_inner < 8; x1_inner++)
                    {
                        out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner];
                    }
                }
                ()
                ;
            }
            if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L)))
            {
                for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++)
                {
                    out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)];
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128812
Approved by: https://github.com/jgong5
2024-11-25 04:46:07 +00:00
Jason Ansel
995e3079c9 [inductor] Fix for "Failed to find static RBLOCK" (#141434)
Summary: I expect this to fix https://fb.workplace.com/groups/1075192433118967/permalink/1547962839175255/

Test Plan: Ask poster to confirm fix

Differential Revision: D66413828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141434
Approved by: https://github.com/ezyang
2024-11-23 22:08:56 +00:00
PyTorch MergeBot
f23621ec56 Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597)"
This reverts commit c25b201583.

Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))
2024-11-22 15:43:39 +00:00
Jason Ansel
3acc6eac49 [inductor] Add typing to ir.py 2 (#140915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140915
Approved by: https://github.com/aorenste
2024-11-22 04:56:54 +00:00
Isuru Fernando
c25b201583 Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-22 02:04:36 +00:00
sanchitintel
ca9813ea14 Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case (#140258)
As suggested by @leslie-fang-intel in 4c83e4e751 (diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537) - all elements of `B` tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since `B` matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance.

Also rectified the size of the dequantized B buffer.

Fixes #140208.

A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140258
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-11-22 01:34:06 +00:00
Jason Ansel
6eca0aee76 [inductor] Refactor ir.Layout into ir.OutputSpec (#140910)
This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs).  Which should make typing easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910
Approved by: https://github.com/ezyang
ghstack dependencies: #140895
2024-11-21 20:01:57 +00:00
Colin Peppler
827f2f749e [CUTLASS] Raise NotImplementedError if X & W aren't FixedLayout (#140985)
Summary: title

Differential Revision: D66131402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140985
Approved by: https://github.com/Skylion007
2024-11-21 19:59:19 +00:00
Sam Ginzburg
a847790400 [inductor] reset to zero support for user defined Triton kernels (#140982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140982
Approved by: https://github.com/aakhundov
2024-11-21 18:53:23 +00:00
PyTorch MergeBot
701e06b643 Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597)"
This reverts commit aefcdb3c9f.

Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it fails inductor/test_padding in trunk. This is a target determination miss and that failed test was not run in your PR ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2489641453))
2024-11-20 22:13:57 +00:00
Isuru Fernando
aefcdb3c9f Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-20 20:26:49 +00:00
Jason Ansel
808f0f656d [inductor] Refactor MutableBox to make IRNode typing easier (#140895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-11-20 19:50:46 +00:00
Benjamin Glass
4ffce45100 AOTInductor: properly generate cpp_wrapper runtime assertions (#141050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141050
Approved by: https://github.com/desertfire
ghstack dependencies: #141058
2024-11-20 19:17:47 +00:00
Benjamin Glass
5c684503a6 cpp_wrapper: Fix dtype_view wrapping reinterpret_view (#141058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141058
Approved by: https://github.com/desertfire
2024-11-20 19:17:47 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
eellison
eff22171d2 Add Current Mask Var To CSE Cache Key (#140838)
This torch.cat kernel has multiple subblocks which load from the same input. We were incorrectly reusing the mask vars from the first load for the second load.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140838
Approved by: https://github.com/jansel
ghstack dependencies: #140841
2024-11-20 00:55:56 +00:00
Henry Tsang
4f2543c31d [logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198)
Adding some dynamo timed for the purpose of better understanding AOTI compilation time.

Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there.

run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level?

traces:
<img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c">

Differential Revision: D65554141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198
Approved by: https://github.com/desertfire
2024-11-19 18:54:17 +00:00