Commit Graph

79 Commits

Author SHA1 Message Date
Jiong Gong
5d62d12557 [Inductor] support transpose vertical reduction in cpp (#97781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781
Approved by: https://github.com/jansel
2023-04-03 02:02:15 +00:00
Jiong Gong
bf22ecba2a [Inductor] support vertical reduction in cpp (#97644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644
Approved by: https://github.com/jansel
2023-04-03 01:29:12 +00:00
Jiong Gong
8e5f491623 [Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626)
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626
Approved by: https://github.com/jansel
2023-04-03 01:11:20 +00:00
chunyuan
004bb34f42 inductor: fix vision_maskrcnn dynamic_shapes error on CPU (#97312)
Fix several c++ compilation errors in `vision_maskrcnn` in dynamic_shapes cases:
1. convert `ceiling` to `std::ceil` in `CppPrinter`:
```bash
error: ‘ceiling’ was not declared in this scope
   17 |                 for(long i1=0; i1<ceiling(1.8735363483429*ks1); i1+=1)
```

2. convert index in `store` to `INDEX_TYPE`:
```bash
error: invalid types ‘float*[double]’ for array subscript
   52 |                         out_ptr0[i2 + (i1*(floor(1.8735363483429*ks2))) + (i0*(std::ceil((1.87353634834290*ks1)))*(floor(1.8735363483429*ks2)))] = tmp30;
```

3. convert offset, size, steps in loop to  `INDEX_TYPE`:
```bash
error: invalid controlling predicate
   16 |                 for(long i1=0; i1<std::ceil((1.87353634834290*ks1)); i1+=1)
```

4. convert index in `load` to  `INDEX_TYPE`:
```bash
error: invalid types ‘float*[double]’ for array subscript
   64 |                     auto tmp0 = out_ptr0[i1 + (i0*(floor(1.8735363483429*ks2)))];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97312
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-03-29 10:24:57 +00:00
Jason Ansel
bc86af0d37 Remove DeferredIndentedBuffer (#97616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616
Approved by: https://github.com/desertfire
2023-03-28 23:13:41 +00:00
haozhe.zhu
a1ada050f8 do not insert to_dtype for memory copy only buffers (#97147)
Remove redundant to_dtype like
`load_bf16 + to_fp32 + to_bf16 + store_bf16` => `load_bf16 + store_bf16`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97147
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-27 14:55:41 +00:00
Liao, Xuan
a331cd4314 [inductor] fix cpp legalize bf16 reduction (#97228)
When legalizing bf16 for reduction, operators with result dtype of torch.int64, like argmax, would encounter an assertion error now. The PR fixes for the case of int64, enabling several bf16 models (hf_Reformer, doctr_reco_predictor) to run successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97228
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire
2023-03-23 08:52:25 +00:00
Jiong Gong
35439e8610 [Inductor] add guards to guarantee vector int32 only used by comparison ops (for masked load) (#97144)
Fix https://github.com/pytorch/pytorch/issues/97124 and https://github.com/pytorch/pytorch/issues/97127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97144
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-03-23 03:12:50 +00:00
Wang, Eikan
c5d7ed9423 [Inductor] Fix the issue that cannot pass lint check for debug mode (#97249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-03-22 04:25:44 +00:00
Jiong Gong
4733de18fd [Inductor] Add debug logging to explain reasons of disabling vectorization (#97108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97108
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-03-22 02:38:34 +00:00
Wang, Eikan
915cbf8208 [Inductor] Eliminate redundant to_dtype node (#96650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-18 01:51:38 +00:00
Bin Bao
f03db8d6cb [reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822.
Solved the long compilation issue for inductor cpp tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-14 16:10:54 +00:00
Wang, Eikan
bdd09e68e4 [Inductor] Legalize BF16 (#96183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96183
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-03-14 10:16:15 +00:00
Jiong Gong
190e284bd3 [Inductor] apply vec float mask on logical comparison ops in cpp (#96502)
Fix https://github.com/pytorch/pytorch/issues/96446
The root cause is that the logical comparison op works on the integer vector which is later used in the `where` op that expects a float vector.
1. Make sure float vec mask is applied on logical comparison ops.
2. Fix vec int specialization for `to_float_mask`. Assume int mask as input and returns the float mask with reinterpret cast.
3. Add a no-op specialization for `to_float_mask` function with the float vec as input.
4. Pass value instead of ref to `to_float_mask`. Passing by value should be efficient enough.
5. Remove a conditional check `!=0` in `masked()` since `to_float_mask` is guaranteed to return a float mask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96502
Approved by: https://github.com/EikanWang, https://github.com/XiaobingSuper, https://github.com/jansel
2023-03-14 08:47:14 +00:00
XiaobingSuper
279ada515a inductor(cpu): make variable number used of masked vectorization path align with scalar path (#96510)
Fix https://github.com/pytorch/pytorch/issues/96484, for CPP reduction vectorization path, there has an assumption that the vectorization path var number used should be aligned with the scalar path, but currently, masked doesn't meet such requirement and will report var not defined error.

before:
```
{
    {
        {
            #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}})
            float tmp7 = std::numeric_limits<float>::infinity();
            auto tmp7_vec = at::vec::Vectorized<float>(tmp7);
            for(long i0=0; i0<0; i0+=1)
            {
                auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
                auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0));
                auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2));
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = at::vec::Vectorized<float>(0.0);
                {
                    auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]);
                    tmp3 = decltype(tmp4)::blendv(tmp3, tmp4, to_float_mask(tmp2) != at::vec::Vectorized<float>(0));
                }
                auto tmp6 = tmp3 + tmp5;
                tmp7_vec = at::vec::minimum(tmp7_vec, tmp6);
            }
            #pragma omp simd simdlen(8)  reduction(min:tmp8)
            for(long i0=0; i0<2; i0+=1)
            {
                auto tmp6 = in_ptr1[i0];
                auto tmp0 = static_cast<long>(0);
                auto tmp1 = static_cast<long>(2);
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = in_ptr0[0];
                    return tmp4;
                }
                ;
                auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                auto tmp7 = tmp5 + tmp6;
                tmp8 = std::min(tmp8, tmp7);
            }
            tmp7 = std::min(tmp7, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp7_vec));
            out_ptr0[0] = tmp8;
        }
    }
}
```
after:

```
{
    {
        {
            #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}})
            float tmp8 = std::numeric_limits<float>::infinity();
            auto tmp8_vec = at::vec::Vectorized<float>(tmp8);
            for(long i0=0; i0<0; i0+=1)
            {
                auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
                auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0));
                auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2));
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]);
                    return tmp4;
                }
                ;
                auto tmp5 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2) != at::vec::Vectorized<float>(0));
                auto tmp7 = tmp5 + tmp6;
                tmp8_vec = at::vec::minimum(tmp8_vec, tmp7);
            }
            #pragma omp simd simdlen(8)  reduction(min:tmp8)
            for(long i0=0; i0<2; i0+=1)
            {
                auto tmp6 = in_ptr1[i0];
                auto tmp0 = static_cast<long>(0);
                auto tmp1 = static_cast<long>(2);
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = in_ptr0[0];
                    return tmp4;
                }
                ;
                auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                auto tmp7 = tmp5 + tmp6;
                tmp8 = std::min(tmp8, tmp7);
            }
            tmp8 = std::min(tmp8, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp8_vec));
            out_ptr0[0] = tmp8;
        }
    }
}

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96510
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-13 09:41:23 +00:00
PyTorch MergeBot
fe05266fda Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)"
This reverts commit deaf9e5e65.

Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378
2023-03-09 01:45:24 +00:00
Bin Bao
deaf9e5e65 [reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985
Approved by: https://github.com/jansel
2023-03-08 20:02:32 +00:00
Jason Ansel
fe4fec37a4 [inductor] Refactor IR printing (#96024)
Reland #95567 part 2.  The previous version of this had a bug which that
added test triggers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96024
Approved by: https://github.com/ngimel
2023-03-07 02:23:06 +00:00
Liao, Xuan
e168dbb90a [inductor] improve cpp vec implementation of square (#96072)
For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance.

Micro-benchmark performance for eager v.s. inductor:
op=`aten.native_batch_norm.default`
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

suite | improvement_0.2 | improvement_0.5 | improvement_0.8 | current_speedup_0.2 | new_speedup_0.2 | current_speedup_0.5 | new_speedup_0.5 | current_speedup_0.8 | new_speedup_0.8
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
torchbench | 8.82% | 5.53% | 32.19% | 0.608006834 | 0.661613139 | 0.691743711 | 0.729987622 | 0.76176223 | 1.00694842
timm | 59.30% | 63.01% | 94.77% | 0.650648524 | 1.036498047 | 0.676425152 | 1.102667387 | 0.695693384 | 1.354992423

</body>

</html>

Model training performance for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

model | improvement | current_speedup | new_speedup
-- | -- | -- | --
lcnet_050 multi-thread | 5.16% | 1.046 | 1.1
lcnet_050 single-thread | 21.81% | 0.94 | 1.145
mobilenet_v2 multi-thread | 3.88% | 1.135 | 1.179
mobilenet_v2 single-thread | 37.46% | 0.929 | 1.277

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-03-07 01:13:39 +00:00
zhuhaozhe
ebaf9af76e use float to init reduction value (#95452)
Fixes https://github.com/pytorch/pytorch/issues/95195, https://github.com/pytorch/pytorch/issues/95185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95452
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-06 08:49:36 +00:00
Jason Ansel
43dd043ea7 Revert "[inductor] Improve error messages (#95567)" (#96014)
This reverts commit 62b775583f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014
Approved by: https://github.com/Chillee
2023-03-04 04:03:31 +00:00
Nikita Karetnikov
feffcafe09 [inductor] use FP div in CPP expr printer (#95698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95698
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-03-03 20:38:18 +00:00
PyTorch MergeBot
879400e4e8 Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)"
This reverts commit 73b66098b2.

Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, 73b66098b2 (11745396725)
2023-03-03 17:33:27 +00:00
Bin Bao
73b66098b2 [inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)
Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822
Approved by: https://github.com/jansel
2023-03-03 14:18:09 +00:00
Jason Ansel
62b775583f [inductor] Improve error messages (#95567)
Example error message before/after (710 to 131 lines):
https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567
Approved by: https://github.com/mlazos
2023-03-02 02:20:55 +00:00
Wang, Eikan
9da903f180 [Inductor] Fix the logical_and/logical_or vectorization issue (#95609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95609
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-01 08:21:57 +00:00
Wang, Eikan
c1f5e50fd1 [Inductor] Vectorize channels-last adaptive_avg_pool2d (#95608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95608
Approved by: https://github.com/jansel
2023-03-01 08:21:57 +00:00
Kazuaki Ishizaki
9a4cb9bcaf Fix typos under torch/_inductor directory (#95601)
This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95601
Approved by: https://github.com/ezyang
2023-02-27 19:00:17 +00:00
XiaobingSuper
4846d52212 inductor: fix complier error when trying to vectorize logit_and and logit_or (#95361)
Currently, `operator&& `  and `operator|| ` don't have vectorization implementation, disable them now for a quick fix for 2.0 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95361
Approved by: https://github.com/ngimel, https://github.com/EikanWang
2023-02-24 02:30:13 +00:00
XiaobingSuper
1d7133c542 inductor(cpu): fix C++ compile error when sigmoid's post ops is a reduction op (#94890)
For timm **nfnet_l0** model. CPU path has the following error: `torch._dynamo.exc.BackendCompilerFailed: inductor raised CppCompileError: C++ compile error`.

There has a simple test case:

```
def fn(x):
    x = torch.ops.aten.sigmoid.default(x)
    return torch.ops.aten.mean.dim(x, [-1, -2], True)

x = torch.randn((1, 8, 8, 8))
opt_fn = torch._dynamo.optimize("inductor")(fn)
opt_fn(x)

real_out = fn(x)
compiled_out = opt_fn(x)
tol = 0.0001
print(torch.allclose(real_out, compiled_out, atol=tol, rtol=tol))

```

before:

```
extern "C" void kernel(float* __restrict__ in_out_ptr0,
                       const float* __restrict__ in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            {
                #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})
                float tmp2 = 0;
                auto tmp2_vec = at::vec::Vectorized<float>(tmp2);
                for(long i1=0; i1<4; i1+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16*i1) + (64*i0));
                    auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                    tmp2_vec += tmp1;
                }
                #pragma omp simd simdlen(8)  reduction(+:tmp3)
                for(long i1=64; i1<64; i1+=1)
                {
                    auto tmp0 = in_ptr0[i1 + (64*i0)];
                    auto tmp1 = std::exp(-tmp0);
                    auto tmp2 = 1 / (1 + tmp1);
                    tmp3 += tmp2;
                }
                tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec);
                out_ptr0[i0] = tmp3;
            }
        }
    }
    {
        for(long i0=0; i0<0; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16*i0);
            auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64));
            auto tmp2 = tmp0 / tmp1;
            tmp2.store(in_out_ptr0 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=0; i0<8; i0+=1)
        {
            auto tmp0 = out_ptr0[i0];
            auto tmp1 = static_cast<float>(64);
            auto tmp2 = tmp0 / tmp1;
            in_out_ptr0[i0] = tmp2;
        }
    }
}
```

after:
```
extern "C" void kernel(float* __restrict__ in_out_ptr0,
                       const float* __restrict__ in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<8; i0+=1)
            {
                {
                    #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})
                    float tmp2 = 0;
                    auto tmp2_vec = at::vec::Vectorized<float>(tmp2);
                    for(long i1=0; i1<4; i1+=1)
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16*i1) + (64*i0));
                        auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                        tmp2_vec += tmp1;
                    }
                    #pragma omp simd simdlen(8)  reduction(+:tmp2)
                    for(long i1=64; i1<64; i1+=1)
                    {
                        auto tmp0 = in_ptr0[i1 + (64*i0)];
                        auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0));
                        tmp2 += tmp1;
                    }
                    tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec);
                    out_ptr0[i0] = tmp2;
                }
            }
        }
        #pragma omp single
        {
            {
                for(long i0=0; i0<0; i0+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16*i0);
                    auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64));
                    auto tmp2 = tmp0 / tmp1;
                    tmp2.store(in_out_ptr0 + 16*i0);
                }
                #pragma omp simd simdlen(8)
                for(long i0=0; i0<8; i0+=1)
                {
                    auto tmp0 = out_ptr0[i0];
                    auto tmp1 = static_cast<float>(64);
                    auto tmp2 = tmp0 / tmp1;
                    in_out_ptr0[i0] = tmp2;
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94890
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/lezcano
2023-02-15 17:13:45 +00:00
Fabio Rocha
1dbaa5c290 Use decompositions for some fallbacks introduced in #94039 (#94206)
In some cases, implements required inductor primitives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94206
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-02-14 09:31:30 +00:00
Xuehai Pan
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
XiaobingSuper
02b8a7f473 inductor: don't do transpose vectoriztion if input ld depends on most inner var (#94493)
Fixed https://github.com/pytorch/pytorch/issues/94269.

For the following case:

```
**import torch
import torchvision
#import intel_extension_for_pytorch

import torch._dynamo
from torch._inductor import config

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        constant_pad_nd = x
        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:195, code: kv = kv.unfold(2, self.win_size, self.block_size).unfold(3, self.win_size, self.block_size)
        as_strided: f32[1, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(constant_pad_nd, [1, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]);  constant_pad_nd = None
        as_strided_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.as_strided.default(as_strided, [1, 384, 2, 2, 12, 12], [153600, 1, 61440, 3072, 7680, 384]);  as_strided = None

        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:197, code: kv = kv.reshape(
        clone_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.clone.default(as_strided_1, memory_format = torch.contiguous_format);  as_strided_1 = None
        _unsafe_view_1: f32[8, 48, 4, 144] = torch.ops.aten._unsafe_view.default(clone_1, [8, 48, 4, 144]);  clone_1 = None
        permute_2: f32[8, 4, 144, 48] = torch.ops.aten.permute.default(_unsafe_view_1, [0, 2, 3, 1]);  _unsafe_view_1 = None
        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:202, code: k, v = torch.split(kv, [self.dim_head_qk, self.dim_head_v], dim=-1)
        split_with_sizes = torch.ops.aten.split_with_sizes.default(permute_2, [16, 32], -1);  permute_2 = None
        getitem: f32[8, 4, 144, 16] = split_with_sizes[0]
        getitem_1: f32[8, 4, 144, 32] = split_with_sizes[1];  split_with_sizes = None
        permute_3: f32[8, 4, 16, 144] = torch.ops.aten.permute.default(getitem, [0, 1, 3, 2]);  getitem = None
        expand_1: f32[8, 4, 16, 144] = torch.ops.aten.expand.default(permute_3, [8, 4, 16, 144]);  permute_3 = None
        clone_3: f32[8, 4, 16, 144] = torch.ops.aten.clone.default(expand_1, memory_format = torch.contiguous_format);  expand_1 = None
        return clone_3

model = Model().eval()
opt_model = torch._dynamo.optimize('inductor')(model)
x = torch.randn(1, 384, 20, 20).to(memory_format=torch.channels_last)

ref = model(x)

with torch.no_grad():
    for i in range(3):
        out = opt_model(x)

print(torch.equal(ref, out))
```

The generated code before this PR is:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ni/cniims6nap7c5wars7cmtbjr3mw6b5cxyoyxmsu7ro2l5fkrwatl.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<4; i1+=1)
            {
                #pragma GCC ivdep
                for(long i2=0; i2<1; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<9; i3+=1)
                    {
                        float tmp0[16*16] __attribute__ ((aligned (16)));
                        at::vec::transpose_mxn<float,16,16>(in_ptr0 + (16*i2) + (48*i0) + (384*((16*i3) % 12)) + (3072*(i1 % 2)) + (7680*(((4*i3) / 3))) + (61440*(i1 / 2)), ((-7680)*(i3 / 12)) + ((-384)*(i3 % 12)) + (384*((1 + i3) % 12)) + (7680*(((1 + i3) / 12))), tmp0, 16);
                        for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                        {
                            auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + 16*i2_inner);
                            tmp1.store(out_ptr0 + (16*i3) + (144*i2_inner) + (2304*i1) + (2304*i2) + (9216*i0));
                        }
                    }
                    #pragma GCC ivdep
                    for(long i3=144; i3<144; i3+=1)
                    {
                        for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                        {
                            auto tmp0 = in_ptr0[i2_inner + (16*i2) + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                            out_ptr0[i3 + (144*i2_inner) + (2304*i1) + (2304*i2) + (9216*i0)] = tmp0;
                        }
                    }
                }
                #pragma GCC ivdep
                for(long i2=16; i2<16; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<144; i3+=1)
                    {
                        auto tmp0 = in_ptr0[i2 + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                        out_ptr0[i3 + (144*i2) + (2304*i1) + (9216*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )
```

After:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/dm/cdmaihqxwe73zkb3he2zizktpq5uujetg2db26c3r4lgsmlx3b4c.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<4; i1+=1)
            {
                #pragma GCC ivdep
                for(long i2=0; i2<16; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<144; i3+=1)
                    {
                        auto tmp0 = in_ptr0[i2 + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                        out_ptr0[i3 + (144*i2) + (2304*i1) + (9216*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((1, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94493
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang
2023-02-10 09:04:45 +00:00
Wang, Eikan
1767026d1e Abstract the optimization context information as a dedicated class to better organize the code (#92057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92057
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-08 08:25:22 +00:00
Wang, Eikan
88ef4739b2 Check the semantic of loading the mask value (#91755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91755
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-08 02:34:22 +00:00
Jiong Gong
a28a062938 [Inductor] Fix CPU vectorized implementation of mask calculation that breaks torch.where (#93922)
Fix https://github.com/pytorch/pytorch/issues/93374

The cause of the issue is that the original vectorized float mask calculation doesn't consider the broadcast case. This PR adds the support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93922
Approved by: https://github.com/XiaobingSuper, https://github.com/desertfire, https://github.com/jansel
2023-02-07 11:30:21 +00:00
Wang, Eikan
9895c19a7a To vectorize long datatype as mask index (#91076)
In this PR, we record the current fx node being executed to cache additional information to simply the vectorization checker. In addition, we supported `masked` in this PR by simplifying it as `mask_load` to support `max_pool2d`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91076
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-05 03:36:22 +00:00
Liao, Xuan
11de399447 [inductor] fix cpu implement of torch.neg (#94035)
Fixes #93380

Fix to maintain the data type after doing neg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94035
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-04 03:13:11 +00:00
Peter Bell
b7a5c79399 [inductor] Fix type inference in CPU masked operations (#93842)
Fixes #93351

The existing code guesses that `tmp3` is probably a `float`, and so truncates
any `double` values

```cpp
float tmp3 = 0.0;
if(tmp2)
{
    auto tmp4 = in_ptr0[i0];
    tmp3 = tmp4;
}
```

The proposed change is to generate a lambda expression that represents the body
of the masked operation, and infer the type from the return value:
```cpp
auto tmp3 = [&]
{
    auto tmp4 = in_ptr0[i0];
    return tmp4;
}
;
auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93842
Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/jansel
2023-02-02 22:42:19 +00:00
Jason Ansel
45eadc2c4d ConfigModule for _{dynamo,inductor}.config (#93252)
This refactors the way dynamo/inductor configs are handled to check for invalid configs and add options like patching and serialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93252
Approved by: https://github.com/voznesenskym
2023-02-01 19:38:05 +00:00
min-jean-cho
68a40a47a0 [Inductor] Lower aten.tan (#92837)
Related #92047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92837
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-01-24 16:35:40 +00:00
Horace He
19c9b09449 Replace IndexingDiv with FloorDiv in Inductor (#92878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878
Approved by: https://github.com/ezyang
2023-01-24 15:06:22 +00:00
Peter Bell
5644059489 [inductor] Lower torch.exp2 and use it for torch.pow(2, x) (#92632)
Before
```python
    tmp0 = 2.0
    tmp2 = tl.libdevice.pow(tmp0, tmp1)
```

After
```python
    tmp1 = tl.libdevice.exp2(tmp0)
```

I've benchmarked on CPU and CUDA with the following examples
```
@torch._dynamo.optimize()
def exp2(x):
    return torch.pow(2, x)

@torch._dynamo.optimize()
def logaddexp2(a, b):
    m = torch.maximum(a, b)
    return m + torch.log2(1 + torch.pow(2, -torch.abs(a-b)))
```

triton is able to specialize `pow(2, x)` such that this makes
no difference, but on CPU I see a surprisingly large speedup.

| device | Function  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------------|--------------|---------|
| CUDA   | exp2      | 64          | 63           | 1.0     |
|        | logaddexp | 109         | 107          | 1.0     |
| CPU    | exp2      | 220         | 40           | 5.5     |
|        | logaddexp | 282         | 140          | 2.0     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92632
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-01-20 22:06:23 +00:00
Liao, Xuan
119d5e425c [Inductor] decompose expm1 for CPP vec (#92289)
For micro-bench op `aten.elu.default` in TIMM, the performance is not good even though with vectorization. `Elu` uses `expm1` as a sub-op. It turns out that inductor invokes sleef `expm1` function while aten decomposes it with `exp - 1`. The former one performs worse than the latter one. This PR decomposes `expm1` for cpp vectorization to make performance come back.

Performance data for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link=blue vlink=purple>

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link=blue vlink=purple>

suite | improved_ratio_speedup | speedup_old | RSD(3) | speedup_new | RSD(3)
-- | -- | -- | -- | -- | --
timm | 114.38% | 0.803447768 | 8.39% | 1.722458 | 27.74%

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92289
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-20 05:29:32 +00:00
Peter Bell
30f2026863 [inductor] Promote half-precision CPU constants to float (#91224)
Currently `aten.where` can fail with the following C++ compiler error:
```
error: operands to '?:' have different types 'c10::Half' and 'float'
```

This happens because `ops.load` is overridden to cast Half inputs to float, but
`ops.constant` will load a Half without promoting to float.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91224
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ngimel
2023-01-18 01:04:36 +00:00
Jiong Gong
2eaa7a25d0 Fix model accuracy issue caused by vectorized transpose (#92299)
Fix accuracy issues from models: jx_nest_base, cait_m36_384, XLNetLMHeadModel, Super_SloMo
https://github.com/pytorch/torchdynamo/issues/2038
https://github.com/pytorch/torchdynamo/issues/2037
https://github.com/pytorch/torchdynamo/issues/2036
https://github.com/pytorch/torchdynamo/issues/2035

The inner loop list should be newly created in loop.clone().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92299
Approved by: https://github.com/desertfire
2023-01-17 17:53:45 +00:00
Jiong Gong
7c641eaaf0 [Inductor] Support vectorized transpose in CPP backend (#91532)
Fix https://github.com/pytorch/torchdynamo/issues/1915
This PR adds the vectorization support for transposed operations in TorchInductor CPP backend. It contains the following changes:
1. `CppTile2DKernelChecker` is added to check the eligibility of applying the optimization. We only addresss a narrow set of situations. All of the following conditions should be met: 1) There exists one and only one fp32 load/store with outer loop var having contiguous buffer accesses. 2) When a load/store doesn't have contiguous access in an outer loop var, the access should be vectorizable from the inner-most dim. 3) No reduction. More scenarios/operations would be supported in the future PRs.
2. If `CppTile2DKernelChecker` reports the optimization is doable, `CppKernelProxy` would split/tile the loops from both the outer loop var having contiguous buffer access and the inner-most loop var.
3. The main loop split from the outer loop var is further split at the inner-most level and then handled by `CppTile2DKernel` and `CppTile2DTailKernel` which generate the transposed load/store. The former kernel does the vectorized transposed load/store on tiles and then does vectorized load/store/compute along the inner-most loop axis. The vectorized transpose micro-kernel implementation borrows/refers to that from FBGEMM. The latter kernel simply does scalar operations.
4. The tail loop split from the outer loop var directly calls `CppKernel` with scalar operations.

Next steps:
1. Support vectorized transpose with smaller tile size at one dim but bigger tile size at the other, e.g., 3x784.
2. Support reduction vectorized on the outer loop var (contiguous from outer loop var, not with inner-most loop var)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91532
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-01-12 17:20:39 +00:00
Natalia Gimelshein
44413f2525 properly convert fill value to x dtype in constant_pad (#92045)
Fixes #92038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92045
Approved by: https://github.com/desertfire
2023-01-12 05:41:10 +00:00
Jiong Gong
859ac58c54 [Inductor] Support loop split at given depth in CPP codegen (#91397)
This PR refactors the loop related data structure to support the loop split at a given depth. Before this PR, the loop split is always supported at the inner-most level. With this PR, it is possible to support tiling at outer levels and at more than one levels. The `LoopNest` data structure is extended to support loop splits at various levels and renamed to `LoopNestWithSplit`. The `codegen_loops` function is also rewritten to be general to support arbitrary kernels set at the leaves of the loop structure.

This PR also improves the handling of reduction loops with split. The main loop and tail loop now work on their own reduction variables in parallel without data dependency as previous do. With this, two workarounds can be removed in the `CppVecKernel`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91397
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-01-06 12:53:46 +00:00
Wu, Chunyuan
c99a2a43ad [inductor] decompose tanh in CPP backend (#91687)
## Description
The decomposition of `tanh` has been removed in https://github.com/pytorch/pytorch/pull/90889.
```python
@register_decomposition([aten.tanh])
def tanh(x):
    return 2.0 / (1.0 + torch.exp(-2.0 * x)) - 1.0
```
We've observed performance regression on CPU for `lennard_jones` in the TorchBench suite.
This PR decomposes `tanh` in CPP backend to fix the regression.

### Performance

- Model: lennard_jones
- Machine: IceLake (32 cores per socket)
- Configuration: single instance, 32 cores per instance
- jemalloc and iomp enabled

```bash
python benchmarks/dynamo/torchbench.py  --inductor-settings --inductor --performance --float32 -dcpu -n500  --no-skip --dashboard --only=lennard_jones --quiet
```

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

Time before   regression | Time after regression | Time with this PR
-- | -- | --
0.000262036 | 0.0003618 | 0.000267888

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91687
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-01-06 10:05:36 +00:00