Commit Graph

83 Commits

Author SHA1 Message Date
Nikita Karetnikov
938c5da61e [inductor] do not generate loops when the condition doesn't hold (#98185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98185
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-04-06 17:22:16 +00:00
Bin Bao
348dcf51e5 [inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088)
Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one
file, which unifies the codegen for AOT and non-AOT mode. There will be
more refactoring for the AOT part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-06 15:59:55 +00:00
Liao, Xuan
177994eb54 [inductor] [cpp] fix bitwise codegen (#98056)
Fixes #97968

Fix to maintain the data type after doing bitwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98056
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-04 01:33:31 +00:00
Bin Bao
96f548a1ac [inductor] Add an AOT mode for the Triton backend (#98214)
Summary:
This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make
the landing easier.

This PR implements a two-pass wrapper codegen for the Triton
backend to achieve ahead-of-time compilation. In the first pass, the
regular python wrapper code will be generated, and then the generated
code will be executed to perform Triton compilation and autotuning.
After that, the second pass wrapper codegen will generate C++ wrapper
with proper CUDA API to load and launch Triton-generated CUDA kernels.

Like the AOT mode for the cpp backend, the next step would be to provide
a more complete API for AOT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214
Approved by: https://github.com/eellison
2023-04-03 22:19:18 +00:00
Jiong Gong
5d62d12557 [Inductor] support transpose vertical reduction in cpp (#97781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781
Approved by: https://github.com/jansel
2023-04-03 02:02:15 +00:00
Jiong Gong
bf22ecba2a [Inductor] support vertical reduction in cpp (#97644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644
Approved by: https://github.com/jansel
2023-04-03 01:29:12 +00:00
Jiong Gong
8e5f491623 [Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626)
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626
Approved by: https://github.com/jansel
2023-04-03 01:11:20 +00:00
chunyuan
004bb34f42 inductor: fix vision_maskrcnn dynamic_shapes error on CPU (#97312)
Fix several c++ compilation errors in `vision_maskrcnn` in dynamic_shapes cases:
1. convert `ceiling` to `std::ceil` in `CppPrinter`:
```bash
error: ‘ceiling’ was not declared in this scope
   17 |                 for(long i1=0; i1<ceiling(1.8735363483429*ks1); i1+=1)
```

2. convert index in `store` to `INDEX_TYPE`:
```bash
error: invalid types ‘float*[double]’ for array subscript
   52 |                         out_ptr0[i2 + (i1*(floor(1.8735363483429*ks2))) + (i0*(std::ceil((1.87353634834290*ks1)))*(floor(1.8735363483429*ks2)))] = tmp30;
```

3. convert offset, size, steps in loop to  `INDEX_TYPE`:
```bash
error: invalid controlling predicate
   16 |                 for(long i1=0; i1<std::ceil((1.87353634834290*ks1)); i1+=1)
```

4. convert index in `load` to  `INDEX_TYPE`:
```bash
error: invalid types ‘float*[double]’ for array subscript
   64 |                     auto tmp0 = out_ptr0[i1 + (i0*(floor(1.8735363483429*ks2)))];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97312
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-03-29 10:24:57 +00:00
Jason Ansel
bc86af0d37 Remove DeferredIndentedBuffer (#97616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616
Approved by: https://github.com/desertfire
2023-03-28 23:13:41 +00:00
haozhe.zhu
a1ada050f8 do not insert to_dtype for memory copy only buffers (#97147)
Remove redundant to_dtype like
`load_bf16 + to_fp32 + to_bf16 + store_bf16` => `load_bf16 + store_bf16`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97147
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-27 14:55:41 +00:00
Liao, Xuan
a331cd4314 [inductor] fix cpp legalize bf16 reduction (#97228)
When legalizing bf16 for reduction, operators with result dtype of torch.int64, like argmax, would encounter an assertion error now. The PR fixes for the case of int64, enabling several bf16 models (hf_Reformer, doctr_reco_predictor) to run successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97228
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire
2023-03-23 08:52:25 +00:00
Jiong Gong
35439e8610 [Inductor] add guards to guarantee vector int32 only used by comparison ops (for masked load) (#97144)
Fix https://github.com/pytorch/pytorch/issues/97124 and https://github.com/pytorch/pytorch/issues/97127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97144
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-03-23 03:12:50 +00:00
Wang, Eikan
c5d7ed9423 [Inductor] Fix the issue that cannot pass lint check for debug mode (#97249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-03-22 04:25:44 +00:00
Jiong Gong
4733de18fd [Inductor] Add debug logging to explain reasons of disabling vectorization (#97108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97108
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-03-22 02:38:34 +00:00
Wang, Eikan
915cbf8208 [Inductor] Eliminate redundant to_dtype node (#96650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-18 01:51:38 +00:00
Bin Bao
f03db8d6cb [reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822.
Solved the long compilation issue for inductor cpp tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-14 16:10:54 +00:00
Wang, Eikan
bdd09e68e4 [Inductor] Legalize BF16 (#96183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96183
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-03-14 10:16:15 +00:00
Jiong Gong
190e284bd3 [Inductor] apply vec float mask on logical comparison ops in cpp (#96502)
Fix https://github.com/pytorch/pytorch/issues/96446
The root cause is that the logical comparison op works on the integer vector which is later used in the `where` op that expects a float vector.
1. Make sure float vec mask is applied on logical comparison ops.
2. Fix vec int specialization for `to_float_mask`. Assume int mask as input and returns the float mask with reinterpret cast.
3. Add a no-op specialization for `to_float_mask` function with the float vec as input.
4. Pass value instead of ref to `to_float_mask`. Passing by value should be efficient enough.
5. Remove a conditional check `!=0` in `masked()` since `to_float_mask` is guaranteed to return a float mask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96502
Approved by: https://github.com/EikanWang, https://github.com/XiaobingSuper, https://github.com/jansel
2023-03-14 08:47:14 +00:00
XiaobingSuper
279ada515a inductor(cpu): make variable number used of masked vectorization path align with scalar path (#96510)
Fix https://github.com/pytorch/pytorch/issues/96484, for CPP reduction vectorization path, there has an assumption that the vectorization path var number used should be aligned with the scalar path, but currently, masked doesn't meet such requirement and will report var not defined error.

before:
```
{
    {
        {
            #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}})
            float tmp7 = std::numeric_limits<float>::infinity();
            auto tmp7_vec = at::vec::Vectorized<float>(tmp7);
            for(long i0=0; i0<0; i0+=1)
            {
                auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
                auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0));
                auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2));
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = at::vec::Vectorized<float>(0.0);
                {
                    auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]);
                    tmp3 = decltype(tmp4)::blendv(tmp3, tmp4, to_float_mask(tmp2) != at::vec::Vectorized<float>(0));
                }
                auto tmp6 = tmp3 + tmp5;
                tmp7_vec = at::vec::minimum(tmp7_vec, tmp6);
            }
            #pragma omp simd simdlen(8)  reduction(min:tmp8)
            for(long i0=0; i0<2; i0+=1)
            {
                auto tmp6 = in_ptr1[i0];
                auto tmp0 = static_cast<long>(0);
                auto tmp1 = static_cast<long>(2);
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = in_ptr0[0];
                    return tmp4;
                }
                ;
                auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                auto tmp7 = tmp5 + tmp6;
                tmp8 = std::min(tmp8, tmp7);
            }
            tmp7 = std::min(tmp7, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp7_vec));
            out_ptr0[0] = tmp8;
        }
    }
}
```
after:

```
{
    {
        {
            #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}})
            float tmp8 = std::numeric_limits<float>::infinity();
            auto tmp8_vec = at::vec::Vectorized<float>(tmp8);
            for(long i0=0; i0<0; i0+=1)
            {
                auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
                auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0));
                auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2));
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]);
                    return tmp4;
                }
                ;
                auto tmp5 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2) != at::vec::Vectorized<float>(0));
                auto tmp7 = tmp5 + tmp6;
                tmp8_vec = at::vec::minimum(tmp8_vec, tmp7);
            }
            #pragma omp simd simdlen(8)  reduction(min:tmp8)
            for(long i0=0; i0<2; i0+=1)
            {
                auto tmp6 = in_ptr1[i0];
                auto tmp0 = static_cast<long>(0);
                auto tmp1 = static_cast<long>(2);
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = in_ptr0[0];
                    return tmp4;
                }
                ;
                auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                auto tmp7 = tmp5 + tmp6;
                tmp8 = std::min(tmp8, tmp7);
            }
            tmp8 = std::min(tmp8, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp8_vec));
            out_ptr0[0] = tmp8;
        }
    }
}

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96510
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-13 09:41:23 +00:00
PyTorch MergeBot
fe05266fda Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)"
This reverts commit deaf9e5e65.

Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378
2023-03-09 01:45:24 +00:00
Bin Bao
deaf9e5e65 [reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985
Approved by: https://github.com/jansel
2023-03-08 20:02:32 +00:00
Jason Ansel
fe4fec37a4 [inductor] Refactor IR printing (#96024)
Reland #95567 part 2.  The previous version of this had a bug which that
added test triggers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96024
Approved by: https://github.com/ngimel
2023-03-07 02:23:06 +00:00
Liao, Xuan
e168dbb90a [inductor] improve cpp vec implementation of square (#96072)
For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance.

Micro-benchmark performance for eager v.s. inductor:
op=`aten.native_batch_norm.default`
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

suite | improvement_0.2 | improvement_0.5 | improvement_0.8 | current_speedup_0.2 | new_speedup_0.2 | current_speedup_0.5 | new_speedup_0.5 | current_speedup_0.8 | new_speedup_0.8
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
torchbench | 8.82% | 5.53% | 32.19% | 0.608006834 | 0.661613139 | 0.691743711 | 0.729987622 | 0.76176223 | 1.00694842
timm | 59.30% | 63.01% | 94.77% | 0.650648524 | 1.036498047 | 0.676425152 | 1.102667387 | 0.695693384 | 1.354992423

</body>

</html>

Model training performance for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

model | improvement | current_speedup | new_speedup
-- | -- | -- | --
lcnet_050 multi-thread | 5.16% | 1.046 | 1.1
lcnet_050 single-thread | 21.81% | 0.94 | 1.145
mobilenet_v2 multi-thread | 3.88% | 1.135 | 1.179
mobilenet_v2 single-thread | 37.46% | 0.929 | 1.277

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-03-07 01:13:39 +00:00
zhuhaozhe
ebaf9af76e use float to init reduction value (#95452)
Fixes https://github.com/pytorch/pytorch/issues/95195, https://github.com/pytorch/pytorch/issues/95185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95452
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-06 08:49:36 +00:00
Jason Ansel
43dd043ea7 Revert "[inductor] Improve error messages (#95567)" (#96014)
This reverts commit 62b775583f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014
Approved by: https://github.com/Chillee
2023-03-04 04:03:31 +00:00
Nikita Karetnikov
feffcafe09 [inductor] use FP div in CPP expr printer (#95698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95698
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-03-03 20:38:18 +00:00
PyTorch MergeBot
879400e4e8 Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)"
This reverts commit 73b66098b2.

Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, 73b66098b2 (11745396725)
2023-03-03 17:33:27 +00:00
Bin Bao
73b66098b2 [inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)
Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822
Approved by: https://github.com/jansel
2023-03-03 14:18:09 +00:00
Jason Ansel
62b775583f [inductor] Improve error messages (#95567)
Example error message before/after (710 to 131 lines):
https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567
Approved by: https://github.com/mlazos
2023-03-02 02:20:55 +00:00
Wang, Eikan
9da903f180 [Inductor] Fix the logical_and/logical_or vectorization issue (#95609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95609
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-01 08:21:57 +00:00
Wang, Eikan
c1f5e50fd1 [Inductor] Vectorize channels-last adaptive_avg_pool2d (#95608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95608
Approved by: https://github.com/jansel
2023-03-01 08:21:57 +00:00
Kazuaki Ishizaki
9a4cb9bcaf Fix typos under torch/_inductor directory (#95601)
This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95601
Approved by: https://github.com/ezyang
2023-02-27 19:00:17 +00:00
XiaobingSuper
4846d52212 inductor: fix complier error when trying to vectorize logit_and and logit_or (#95361)
Currently, `operator&& `  and `operator|| ` don't have vectorization implementation, disable them now for a quick fix for 2.0 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95361
Approved by: https://github.com/ngimel, https://github.com/EikanWang
2023-02-24 02:30:13 +00:00
XiaobingSuper
1d7133c542 inductor(cpu): fix C++ compile error when sigmoid's post ops is a reduction op (#94890)
For timm **nfnet_l0** model. CPU path has the following error: `torch._dynamo.exc.BackendCompilerFailed: inductor raised CppCompileError: C++ compile error`.

There has a simple test case:

```
def fn(x):
    x = torch.ops.aten.sigmoid.default(x)
    return torch.ops.aten.mean.dim(x, [-1, -2], True)

x = torch.randn((1, 8, 8, 8))
opt_fn = torch._dynamo.optimize("inductor")(fn)
opt_fn(x)

real_out = fn(x)
compiled_out = opt_fn(x)
tol = 0.0001
print(torch.allclose(real_out, compiled_out, atol=tol, rtol=tol))

```

before:

```
extern "C" void kernel(float* __restrict__ in_out_ptr0,
                       const float* __restrict__ in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            {
                #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})
                float tmp2 = 0;
                auto tmp2_vec = at::vec::Vectorized<float>(tmp2);
                for(long i1=0; i1<4; i1+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16*i1) + (64*i0));
                    auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                    tmp2_vec += tmp1;
                }
                #pragma omp simd simdlen(8)  reduction(+:tmp3)
                for(long i1=64; i1<64; i1+=1)
                {
                    auto tmp0 = in_ptr0[i1 + (64*i0)];
                    auto tmp1 = std::exp(-tmp0);
                    auto tmp2 = 1 / (1 + tmp1);
                    tmp3 += tmp2;
                }
                tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec);
                out_ptr0[i0] = tmp3;
            }
        }
    }
    {
        for(long i0=0; i0<0; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16*i0);
            auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64));
            auto tmp2 = tmp0 / tmp1;
            tmp2.store(in_out_ptr0 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=0; i0<8; i0+=1)
        {
            auto tmp0 = out_ptr0[i0];
            auto tmp1 = static_cast<float>(64);
            auto tmp2 = tmp0 / tmp1;
            in_out_ptr0[i0] = tmp2;
        }
    }
}
```

after:
```
extern "C" void kernel(float* __restrict__ in_out_ptr0,
                       const float* __restrict__ in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<8; i0+=1)
            {
                {
                    #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})
                    float tmp2 = 0;
                    auto tmp2_vec = at::vec::Vectorized<float>(tmp2);
                    for(long i1=0; i1<4; i1+=1)
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16*i1) + (64*i0));
                        auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                        tmp2_vec += tmp1;
                    }
                    #pragma omp simd simdlen(8)  reduction(+:tmp2)
                    for(long i1=64; i1<64; i1+=1)
                    {
                        auto tmp0 = in_ptr0[i1 + (64*i0)];
                        auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0));
                        tmp2 += tmp1;
                    }
                    tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec);
                    out_ptr0[i0] = tmp2;
                }
            }
        }
        #pragma omp single
        {
            {
                for(long i0=0; i0<0; i0+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16*i0);
                    auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64));
                    auto tmp2 = tmp0 / tmp1;
                    tmp2.store(in_out_ptr0 + 16*i0);
                }
                #pragma omp simd simdlen(8)
                for(long i0=0; i0<8; i0+=1)
                {
                    auto tmp0 = out_ptr0[i0];
                    auto tmp1 = static_cast<float>(64);
                    auto tmp2 = tmp0 / tmp1;
                    in_out_ptr0[i0] = tmp2;
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94890
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/lezcano
2023-02-15 17:13:45 +00:00
Fabio Rocha
1dbaa5c290 Use decompositions for some fallbacks introduced in #94039 (#94206)
In some cases, implements required inductor primitives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94206
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-02-14 09:31:30 +00:00
Xuehai Pan
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
XiaobingSuper
02b8a7f473 inductor: don't do transpose vectoriztion if input ld depends on most inner var (#94493)
Fixed https://github.com/pytorch/pytorch/issues/94269.

For the following case:

```
**import torch
import torchvision
#import intel_extension_for_pytorch

import torch._dynamo
from torch._inductor import config

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        constant_pad_nd = x
        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:195, code: kv = kv.unfold(2, self.win_size, self.block_size).unfold(3, self.win_size, self.block_size)
        as_strided: f32[1, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(constant_pad_nd, [1, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]);  constant_pad_nd = None
        as_strided_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.as_strided.default(as_strided, [1, 384, 2, 2, 12, 12], [153600, 1, 61440, 3072, 7680, 384]);  as_strided = None

        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:197, code: kv = kv.reshape(
        clone_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.clone.default(as_strided_1, memory_format = torch.contiguous_format);  as_strided_1 = None
        _unsafe_view_1: f32[8, 48, 4, 144] = torch.ops.aten._unsafe_view.default(clone_1, [8, 48, 4, 144]);  clone_1 = None
        permute_2: f32[8, 4, 144, 48] = torch.ops.aten.permute.default(_unsafe_view_1, [0, 2, 3, 1]);  _unsafe_view_1 = None
        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:202, code: k, v = torch.split(kv, [self.dim_head_qk, self.dim_head_v], dim=-1)
        split_with_sizes = torch.ops.aten.split_with_sizes.default(permute_2, [16, 32], -1);  permute_2 = None
        getitem: f32[8, 4, 144, 16] = split_with_sizes[0]
        getitem_1: f32[8, 4, 144, 32] = split_with_sizes[1];  split_with_sizes = None
        permute_3: f32[8, 4, 16, 144] = torch.ops.aten.permute.default(getitem, [0, 1, 3, 2]);  getitem = None
        expand_1: f32[8, 4, 16, 144] = torch.ops.aten.expand.default(permute_3, [8, 4, 16, 144]);  permute_3 = None
        clone_3: f32[8, 4, 16, 144] = torch.ops.aten.clone.default(expand_1, memory_format = torch.contiguous_format);  expand_1 = None
        return clone_3

model = Model().eval()
opt_model = torch._dynamo.optimize('inductor')(model)
x = torch.randn(1, 384, 20, 20).to(memory_format=torch.channels_last)

ref = model(x)

with torch.no_grad():
    for i in range(3):
        out = opt_model(x)

print(torch.equal(ref, out))
```

The generated code before this PR is:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ni/cniims6nap7c5wars7cmtbjr3mw6b5cxyoyxmsu7ro2l5fkrwatl.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<4; i1+=1)
            {
                #pragma GCC ivdep
                for(long i2=0; i2<1; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<9; i3+=1)
                    {
                        float tmp0[16*16] __attribute__ ((aligned (16)));
                        at::vec::transpose_mxn<float,16,16>(in_ptr0 + (16*i2) + (48*i0) + (384*((16*i3) % 12)) + (3072*(i1 % 2)) + (7680*(((4*i3) / 3))) + (61440*(i1 / 2)), ((-7680)*(i3 / 12)) + ((-384)*(i3 % 12)) + (384*((1 + i3) % 12)) + (7680*(((1 + i3) / 12))), tmp0, 16);
                        for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                        {
                            auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + 16*i2_inner);
                            tmp1.store(out_ptr0 + (16*i3) + (144*i2_inner) + (2304*i1) + (2304*i2) + (9216*i0));
                        }
                    }
                    #pragma GCC ivdep
                    for(long i3=144; i3<144; i3+=1)
                    {
                        for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                        {
                            auto tmp0 = in_ptr0[i2_inner + (16*i2) + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                            out_ptr0[i3 + (144*i2_inner) + (2304*i1) + (2304*i2) + (9216*i0)] = tmp0;
                        }
                    }
                }
                #pragma GCC ivdep
                for(long i2=16; i2<16; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<144; i3+=1)
                    {
                        auto tmp0 = in_ptr0[i2 + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                        out_ptr0[i3 + (144*i2) + (2304*i1) + (9216*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )
```

After:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/dm/cdmaihqxwe73zkb3he2zizktpq5uujetg2db26c3r4lgsmlx3b4c.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<4; i1+=1)
            {
                #pragma GCC ivdep
                for(long i2=0; i2<16; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<144; i3+=1)
                    {
                        auto tmp0 = in_ptr0[i2 + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                        out_ptr0[i3 + (144*i2) + (2304*i1) + (9216*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((1, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94493
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang
2023-02-10 09:04:45 +00:00
Wang, Eikan
1767026d1e Abstract the optimization context information as a dedicated class to better organize the code (#92057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92057
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-08 08:25:22 +00:00
Wang, Eikan
88ef4739b2 Check the semantic of loading the mask value (#91755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91755
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-08 02:34:22 +00:00
Jiong Gong
a28a062938 [Inductor] Fix CPU vectorized implementation of mask calculation that breaks torch.where (#93922)
Fix https://github.com/pytorch/pytorch/issues/93374

The cause of the issue is that the original vectorized float mask calculation doesn't consider the broadcast case. This PR adds the support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93922
Approved by: https://github.com/XiaobingSuper, https://github.com/desertfire, https://github.com/jansel
2023-02-07 11:30:21 +00:00
Wang, Eikan
9895c19a7a To vectorize long datatype as mask index (#91076)
In this PR, we record the current fx node being executed to cache additional information to simply the vectorization checker. In addition, we supported `masked` in this PR by simplifying it as `mask_load` to support `max_pool2d`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91076
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-05 03:36:22 +00:00
Liao, Xuan
11de399447 [inductor] fix cpu implement of torch.neg (#94035)
Fixes #93380

Fix to maintain the data type after doing neg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94035
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-04 03:13:11 +00:00
Peter Bell
b7a5c79399 [inductor] Fix type inference in CPU masked operations (#93842)
Fixes #93351

The existing code guesses that `tmp3` is probably a `float`, and so truncates
any `double` values

```cpp
float tmp3 = 0.0;
if(tmp2)
{
    auto tmp4 = in_ptr0[i0];
    tmp3 = tmp4;
}
```

The proposed change is to generate a lambda expression that represents the body
of the masked operation, and infer the type from the return value:
```cpp
auto tmp3 = [&]
{
    auto tmp4 = in_ptr0[i0];
    return tmp4;
}
;
auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93842
Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/jansel
2023-02-02 22:42:19 +00:00
Jason Ansel
45eadc2c4d ConfigModule for _{dynamo,inductor}.config (#93252)
This refactors the way dynamo/inductor configs are handled to check for invalid configs and add options like patching and serialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93252
Approved by: https://github.com/voznesenskym
2023-02-01 19:38:05 +00:00
min-jean-cho
68a40a47a0 [Inductor] Lower aten.tan (#92837)
Related #92047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92837
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-01-24 16:35:40 +00:00
Horace He
19c9b09449 Replace IndexingDiv with FloorDiv in Inductor (#92878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878
Approved by: https://github.com/ezyang
2023-01-24 15:06:22 +00:00
Peter Bell
5644059489 [inductor] Lower torch.exp2 and use it for torch.pow(2, x) (#92632)
Before
```python
    tmp0 = 2.0
    tmp2 = tl.libdevice.pow(tmp0, tmp1)
```

After
```python
    tmp1 = tl.libdevice.exp2(tmp0)
```

I've benchmarked on CPU and CUDA with the following examples
```
@torch._dynamo.optimize()
def exp2(x):
    return torch.pow(2, x)

@torch._dynamo.optimize()
def logaddexp2(a, b):
    m = torch.maximum(a, b)
    return m + torch.log2(1 + torch.pow(2, -torch.abs(a-b)))
```

triton is able to specialize `pow(2, x)` such that this makes
no difference, but on CPU I see a surprisingly large speedup.

| device | Function  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------------|--------------|---------|
| CUDA   | exp2      | 64          | 63           | 1.0     |
|        | logaddexp | 109         | 107          | 1.0     |
| CPU    | exp2      | 220         | 40           | 5.5     |
|        | logaddexp | 282         | 140          | 2.0     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92632
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-01-20 22:06:23 +00:00
Liao, Xuan
119d5e425c [Inductor] decompose expm1 for CPP vec (#92289)
For micro-bench op `aten.elu.default` in TIMM, the performance is not good even though with vectorization. `Elu` uses `expm1` as a sub-op. It turns out that inductor invokes sleef `expm1` function while aten decomposes it with `exp - 1`. The former one performs worse than the latter one. This PR decomposes `expm1` for cpp vectorization to make performance come back.

Performance data for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link=blue vlink=purple>

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link=blue vlink=purple>

suite | improved_ratio_speedup | speedup_old | RSD(3) | speedup_new | RSD(3)
-- | -- | -- | -- | -- | --
timm | 114.38% | 0.803447768 | 8.39% | 1.722458 | 27.74%

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92289
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-20 05:29:32 +00:00
Peter Bell
30f2026863 [inductor] Promote half-precision CPU constants to float (#91224)
Currently `aten.where` can fail with the following C++ compiler error:
```
error: operands to '?:' have different types 'c10::Half' and 'float'
```

This happens because `ops.load` is overridden to cast Half inputs to float, but
`ops.constant` will load a Half without promoting to float.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91224
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ngimel
2023-01-18 01:04:36 +00:00
Jiong Gong
2eaa7a25d0 Fix model accuracy issue caused by vectorized transpose (#92299)
Fix accuracy issues from models: jx_nest_base, cait_m36_384, XLNetLMHeadModel, Super_SloMo
https://github.com/pytorch/torchdynamo/issues/2038
https://github.com/pytorch/torchdynamo/issues/2037
https://github.com/pytorch/torchdynamo/issues/2036
https://github.com/pytorch/torchdynamo/issues/2035

The inner loop list should be newly created in loop.clone().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92299
Approved by: https://github.com/desertfire
2023-01-17 17:53:45 +00:00