Commit Graph

119 Commits

Author SHA1 Message Date
haozhe.zhu
d01ba4e94e enable fp8 cast for inductor CPU (#117737)
Enable FP8 cast for this issue https://github.com/pytorch/pytorch/issues/117119.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117737
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-23 01:16:15 +00:00
leslie-fang-intel
af831415a8 fix cpp backend relu codegen with inf input (#117622)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/117544.
For CPP backend, current `ReLU` will code gen to `f"{x} * ({x}>0)"` in `CppOverrides`. The result mismatches with eager when input has `inf`, since `inf * 0` will result to `nan` based on [IEEE_754](https://en.wikipedia.org/wiki/IEEE_754). Change the code gen to `f"std::max({x}, decltype({x})(0))"` to align with eager implementation as in 1deb75b584/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L392)

**TestPlan**
```
python -u -m pytest test_cpu_repro.py -k test_relu_with_inf_value
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117622
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-01-20 13:28:03 +00:00
Jiong Gong
3b00dd5843 [inductor][cpp] apply simplify_index_in_vec_range in select_tiling_indices to enable more contiguous vec load (#117260)
For the one of the kernels in the UT `test_vec_contiguous_ModularIndexing`:
Before:
```c++
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                        #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<float, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = in_ptr0[static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (256L*x1_inner) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L)))];
                                }
                                return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0_vec.mean.store(out_ptr0 + static_cast<long>(x1 + (28L*x0)));
                        tmp_acc0_vec.m2.store(out_ptr1 + static_cast<long>(x1 + (28L*x0)));
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L))
                {
                    {
                        #pragma omp declare reduction(    welford:Welford<float>:    omp_out = welford_combine(omp_out, omp_in))     initializer(omp_priv={Welford<float>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr0[static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L)))];
                            tmp_acc0 = welford_combine(tmp_acc0, tmp0);
                        }
                        out_ptr0[static_cast<long>(x1 + (28L*x0))] = tmp_acc0.mean;
                        out_ptr1[static_cast<long>(x1 + (28L*x0))] = tmp_acc0.m2;
                    }
                }
```

After:
```c++
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L))
                {
                    {
                        #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                        #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L))));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
```

This PR also further speeds up the model `swin_base_patch4_window7_224` from 1.25x to 1.28x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117260
Approved by: https://github.com/jansel
ghstack dependencies: #117221
2024-01-15 06:57:25 +00:00
Jiong Gong
172dd13ecf [inductor][cpp] improve vector contiguous checks for FloorDiv and ModularIndexing (#117221)
Fix https://github.com/pytorch/pytorch/issues/114488

The PR tries to enable contiguous vector loads for cases where we can reduce `FloorDiv` and `ModularIndexing` in the vectorized loop.

Take the index expression in test case `test_vec_contiguous_ModularIndexing` for example.
`14336*x0 + 256*x1 + 128*((x2//256)) + ModularIndexing(x2, 1, 128) + 7168*ModularIndexing(x2, 128, 2)` can be reduced to `14336*x0 + 256*x1 + x2 + 128*x2_div_c0 + 7168*x2_mod_c0 + x2_mod_c1` where `x2` is a vectorized loop variable and the vector length is 16. This means we can do vectorized load for this index. Check the code comment for more details:
https://github.com/pytorch/pytorch/pull/117221/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R317-R329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117221
Approved by: https://github.com/jansel
2024-01-12 15:20:36 +00:00
haozhe.zhu
ec443089c7 enable fp16 mkldnn fusion/prepack in inductor (#117206)
- Extend `linear/conv/rnn` packable with `float16`.
- Extend `Unary fusion` to support `float16`.

Test Case:
    Extend bfloat16 related test in `test_cpu_repro.py` and `test_mkldnn_pattern_matcher.py` to test both `fp16` and `bf16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117206
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-12 06:08:42 +00:00
Sun, Jiayi
9f57cf502f [inductor][cpu]disable pointwise_cat on CPU (#116313)
We observed negative performance impact of pointwise_cat optimization on CPU so disabled it. We will revisit this later after enabling vectorization on index_expr.

This PR fix the following three regression issues:
https://github.com/pytorch/pytorch/issues/115827
https://github.com/pytorch/pytorch/issues/112139
https://github.com/pytorch/pytorch/issues/114495

and cause performance regression of pytorch_unet again. Related issue: https://github.com/pytorch/pytorch/issues/115343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116313
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-01-11 08:00:00 +00:00
chunyuan
99ef47098d Use smaller shapes in lstm test to fix the CI timeout (#116453)
Fixes https://github.com/pytorch/pytorch/issues/108824 by using smaller shapes while keeping the same test scope

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116453
Approved by: https://github.com/huydhn, https://github.com/jgong5
2024-01-05 21:19:56 +00:00
Bin Bao
f4230ec9fd [inductor] Remove the float16 restriction for cpu cpp_wrapper (#116205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116205
Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel
2023-12-26 16:01:20 +00:00
Jiong Gong
ffe6f9ac91 [inductor cpp] support vectorization for index_expr that depends on tiling itervar or with indirect indexing (#114545)
As the title, this PR enables vectorization for the situation when the the index_expr depends on vectorized itervar. There are two cases here:
1. The vectorized itervar has constant stride in the index_expr. We vectorize the index_expr with `Vectorized<int32>::arange` for this case.
2. Otherwise, we load the index_expr vector in a non-contiguous way with a loop.

Below is the generated code for the first case from the test `test_concat_inner_vec`. Here `x1` is the index_expr and depends on the vectorized itervar `x1`. It has constant stride 1. We vectorized it with arange. We use `all_zero` to implement a short-cut for masks to avoid unnecessary execution of nested masked regions which are invalid.
Before:
```c++
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(155L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = c10::convert<long>(x1);
                    auto tmp1 = static_cast<long>(0);
                    auto tmp2 = tmp0 >= tmp1;
                    auto tmp3 = static_cast<long>(35);
                    auto tmp4 = tmp0 < tmp3;
                    auto tmp5 = [&]
                    {
                        auto tmp6 = in_ptr0[static_cast<long>(x1 + (35L*x0))];
                        return tmp6;
                    }
                    ;
                    auto tmp7 = tmp4 ? tmp5() : static_cast<decltype(tmp5())>(0.0);
                    auto tmp8 = tmp0 >= tmp3;
                    auto tmp9 = static_cast<long>(155);
                    auto tmp10 = tmp0 < tmp9;
                    auto tmp11 = [&]
                    {
                        auto tmp12 = in_ptr1[static_cast<long>((-35L) + x1 + (120L*x0))];
                        return tmp12;
                    }
                    ;
...
```
After:
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(144L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = c10::convert<int>(x1);
                    auto tmp1 = at::vec::Vectorized<int32_t>::arange(tmp0, 1);
                    auto tmp2 = static_cast<int>(0);
                    auto tmp3 = at::vec::Vectorized<int>(tmp2);
                    auto tmp4 = to_float_mask(tmp1 >= tmp3);
                    auto tmp5 = static_cast<int>(35);
                    auto tmp6 = at::vec::Vectorized<int>(tmp5);
                    auto tmp7 = to_float_mask(tmp1 < tmp6);
                    auto tmp8 = [&]
                    {
                        auto tmp9 = masked_load(in_ptr0 + static_cast<long>(x1 + (35L*x0)), to_float_mask(tmp7));
                        return tmp9;
                    }
                    ;
                    auto tmp10 =
                    [&]
                    {
                        if (all_zero(to_float_mask(tmp7)))
                        {
                            return at::vec::Vectorized<float>(static_cast<float>(0.0));
                        }
                        else
                        {
                            return decltype(tmp8())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp8(), to_float_mask(tmp7));
                        }
                    }
                    ()
                    ;
...
```

Below is the generated code for the second case from the test case `test_expr_vec_non_contiguous`. Here, the index_expr is `31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` which depends on the vectorized itervar `x2` and doesn't have constant stride. So, we load the index_expr vector with a loop. (In fact, this can be further optimized since the index_expr is invariant with the data points in the range [x2, x2+16). So it can be regarded as a scalar. This will be optimized in the follow-up PR.) The code uses `vector_lane_mask_check` to implement the masked version of non-contiguous load.
Before:
```c++
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L)));
                            auto tmp1 = static_cast<long>(2048);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (2048L*(static_cast<long>(x1) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                return tmp4;
                            }
                            ;
                            auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp5);
                        }
                        out_ptr0[static_cast<long>(x1 + (1024L*x0))] = tmp_acc0;
                    }
                }
            }
```
After:
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<int, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L)));
                                }
                                return at::vec::Vectorized<int>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            auto tmp1 = static_cast<int>(2048);
                            auto tmp2 = at::vec::Vectorized<int>(tmp1);
                            auto tmp3 = to_float_mask(tmp0 < tmp2);
                            auto tmp4 = [&]
                            {
                                auto tmp5 =
                                [&]
                                {
                                    __at_align__ std::array<float, 16> tmpbuf;
                                    #pragma GCC unroll 16
                                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                    {
                                        if (vector_lane_mask_check(tmp3, x1_inner))
                                        {
                                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                        }
                                    }
                                    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                                }
                                ()
                                ;
                                return tmp5;
                            }
                            ;
                            auto tmp6 =
                            [&]
                            {
                                if (all_zero(to_float_mask(tmp3)))
                                {
                                    return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                }
                                else
                                {
                                    return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3));
                                }
                            }
                            ()
                            ;
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6);
                        }
                        tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                    }
                }
            }
        }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114545
Approved by: https://github.com/lezcano
2023-12-26 05:36:39 +00:00
vfdev-5
6c2103bdf7 Fixed some failing inductor tests with exact_dtype=True (#115828)
Addresses point 1 from #115742: fixing  CPUReproTest.test_embedding_vec_bf16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115828
Approved by: https://github.com/peterbell10
2023-12-15 20:02:19 +00:00
Jiong Gong
b618869208 [inductor] label cpp test files with oncall: cpu inductor (#115167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115167
Approved by: https://github.com/atalman
2023-12-14 17:39:27 +00:00
Peter Bell
ad76a4e1e7 [inductor] Allow sympy expressions to participate in type promotion (#115676)
In the test example we have `add(i64[10], sympy.Expr)` where
`sympy.Expr` is not considered a promoting arg so isn't factored into
the type promotion. However, in eager it would promote to float32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699, #115700
2023-12-13 22:22:37 +00:00
Bin Bao
0fc04e274d [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-12 01:18:59 +00:00
PyTorch MergeBot
5fe2b138e3 Revert "[inductor] Fix an aliased output bug (#115373)"
This reverts commit 1310f0bf38.

Reverted https://github.com/pytorch/pytorch/pull/115373 on behalf of https://github.com/atalman due to Sorry for reverting your change it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/115373#issuecomment-1850792869))
2023-12-11 20:02:15 +00:00
Bin Bao
1310f0bf38 [inductor] Fix an aliased output bug (#115373)
Summary: for https://github.com/pytorch/pytorch/issues/97083, when

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373
Approved by: https://github.com/jansel
2023-12-10 23:52:39 +00:00
Jiong Gong
bfa2c844a8 [inductor][cpp] avoid redundant lowp type cast for direct load/store (#115006)
Fix https://github.com/pytorch/pytorch/issues/114879. See https://github.com/pytorch/pytorch/issues/114879#issuecomment-1836977610 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115006
Approved by: https://github.com/jansel
2023-12-04 06:39:27 +00:00
Jason Ansel
7979ba7b43 [inductor] Add dropout type check to match eager (#115040)
Fixes #98970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115040
Approved by: https://github.com/oulgen
2023-12-03 23:05:02 +00:00
Jason Ansel
69a8f9b07e [inductor] Fix shape mismatch in sdpa pattern matcher (#115038)
Fixes #100316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115038
Approved by: https://github.com/oulgen
2023-12-03 22:32:12 +00:00
Jiong Gong
a0e3321f0c [inductor cpp] vectorize embedding lookup (#114062)
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.

This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.

For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
                    auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
                    out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
                }
            }
        }
    }
}
```

After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
                    auto tmp6 = tmp4 + tmp5;
                    tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
2023-11-22 11:19:42 +00:00
PyTorch MergeBot
dd6ef0877e Revert "[inductor cpp] vectorize embedding lookup (#114062)"
This reverts commit 2c0474c02d.

Reverted https://github.com/pytorch/pytorch/pull/114062 on behalf of https://github.com/huydhn due to Sorry for reverting your change, please help fix lint and reland it 2c0474c02d ([comment](https://github.com/pytorch/pytorch/pull/114062#issuecomment-1820526515))
2023-11-21 09:21:20 +00:00
Jiong Gong
2c0474c02d [inductor cpp] vectorize embedding lookup (#114062)
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.

This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.

For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
                    auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
                    out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
                }
            }
        }
    }
}
```

After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
                    auto tmp6 = tmp4 + tmp5;
                    tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
ghstack dependencies: #113950
2023-11-21 07:37:15 +00:00
Jiong Gong
1a8d076e0c [inductor cpp] simplify test for uint8 add/sub (#113407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113407
Approved by: https://github.com/lezcano
ghstack dependencies: #113261
2023-11-15 06:17:25 +00:00
Jiong Gong
fcdfcdeef9 [inductor cpp] fix non-contiguous reduction store (#113261)
Fix https://github.com/pytorch/pytorch/issues/113018

The reduction store in this case works on non-contiguous buffer. Previously, we only do scalar fallback for normal stores but not reduction stores. This PR fixes this.

Before fix
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0_vec.store(out_ptr1 + static_cast<long>(x0 + (39L*x1))); // this is wrong since x0 is not vector dim
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }
```

After fix
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp_acc0_vec.store(tmpbuf); for (long x1_inner = 0; x1_inner < 16; x1_inner++) out_ptr1[static_cast<long>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; }
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113261
Approved by: https://github.com/lezcano
2023-11-15 03:27:17 +00:00
PyTorch MergeBot
6bffde99b0 Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit 66d09f8217.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1811666004))
2023-11-15 01:44:26 +00:00
PyTorch MergeBot
1e60174891 Revert "[dynamo] Add run_inductor_tests entrypoint (#113278)"
This reverts commit b00311ce9e.

Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))
2023-11-15 01:19:48 +00:00
Jason Ansel
b00311ce9e [dynamo] Add run_inductor_tests entrypoint (#113278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278
Approved by: https://github.com/yanboliang
2023-11-11 08:54:43 +00:00
Jason Ansel
66d09f8217 [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
ghstack dependencies: #113242
2023-11-11 03:17:35 +00:00
PyTorch MergeBot
68bf0f1e7d Revert "[inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)"
This reverts commit c967dc526a.

Reverted https://github.com/pytorch/pytorch/pull/113275 on behalf of https://github.com/PaliC due to the diff this is stacked on top of appears to be causing inductor failures internally ([comment](https://github.com/pytorch/pytorch/pull/113275#issuecomment-1805131017))
2023-11-10 05:40:55 +00:00
Jiong Gong
cb48f7855a [inductor cpu] fix uint8 add and sub (#113253)
Fix https://github.com/pytorch/pytorch/issues/113016 and https://github.com/pytorch/pytorch/issues/113020 and https://github.com/pytorch/pytorch/issues/113141 and https://github.com/pytorch/pytorch/issues/113143 and https://github.com/pytorch/pytorch/issues/113144
Explicit typecast result of add/sub to uint8 (similar to how we fixed mul previously) to avoid implicit type promotion from C.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113253
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-11-10 04:06:42 +00:00
Jason Ansel
c967dc526a [inductor] Move things into torch/testing/_internal/inductor_utils.py (#113275)
This PR is just moving things around, so code shared by multiple tests files is in torch/testing/_internal/inductor_utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113275
Approved by: https://github.com/yanboliang
2023-11-10 00:11:09 +00:00
Jez Ng
ae85ba820f [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-11-02 07:39:13 +00:00
PyTorch MergeBot
74e6c877e9 Revert "[inductor] Memory planning (#112178)"
This reverts commit f64a97c6f8.

Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too f64a97c6f8 ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))
2023-11-01 00:03:56 +00:00
Jon Chuang
53acdb66f7 [primtorch] aten.normal decomp has wrong return type due to elementwise_type_promotion_wrapper (#112467)
Fixes https://github.com/pytorch/pytorch/issues/112449

elementwise_type_promotion_wrapper will promote `aten.normal` to the dtypes of `mean`, `std` args.

But this is incorrect if we provide the dtype param. Hence, we allow overriding the result_dtype if a specified dtype arg is available.

This problem is unique to `aten.normal`, all other ops decorated do not have a dtype param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112467
Approved by: https://github.com/lezcano
2023-10-31 20:57:09 +00:00
Jez Ng
f64a97c6f8 [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-31 20:02:30 +00:00
Jiong Gong
a1c56df1f0 [inductor cpp] vectorize support for truediv (#112234)
Ops like group_norm has `ops.truediv` that doesn't have vectorization support yet. This PR adds the support.

`test_group_norm_vec`
Before:
```c++
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
            {
                {
                    #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                    #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                    {
                        auto tmp0 = in_ptr0[static_cast<long>(x2 + (1024L*x1) + (32768L*x0))];
                        auto tmp1 = out_ptr0[static_cast<long>(x1 + (32L*x0))];
                        auto tmp3 = out_ptr1[static_cast<long>(x1 + (32L*x0))];
                        auto tmp10 = in_ptr1[static_cast<long>(x1)];
                        auto tmp12 = in_ptr2[static_cast<long>(x1)];
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = c10::convert<float>(1024.0);
                        auto tmp5 = tmp3 / tmp4;
                        auto tmp6 = c10::convert<float>(1e-05);
                        auto tmp7 = tmp5 + tmp6;
                        auto tmp8 = 1 / std::sqrt(tmp7);
                        auto tmp9 = decltype(tmp2)(tmp2 * tmp8);
                        auto tmp11 = decltype(tmp9)(tmp9 * tmp10);
                        auto tmp13 = tmp11 + tmp12;
                        out_ptr2[static_cast<long>(x2 + (1024L*x1) + (32768L*x0))] = tmp13;
                    }
                }
            }
        }
    }
}
```

After:
```c++
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
            {
                {
                    #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                    #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (1024L*x1) + (32768L*x0)));
                        auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(out_ptr0[static_cast<long>(x1 + (32L*x0))]));
                        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(out_ptr1[static_cast<long>(x1 + (32L*x0))]));
                        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(in_ptr1[static_cast<long>(x1)]));
                        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(in_ptr2[static_cast<long>(x1)]));
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1024.0));
                        auto tmp5 = tmp3 / tmp4;
                        auto tmp6 = at::vec::Vectorized<float>(static_cast<float>(1e-05));
                        auto tmp7 = tmp5 + tmp6;
                        auto tmp8 = tmp7.rsqrt();
                        auto tmp9 = tmp2 * tmp8;
                        auto tmp11 = tmp9 * tmp10;
                        auto tmp13 = tmp11 + tmp12;
                        tmp13.store(out_ptr2 + static_cast<long>(x2 + (1024L*x1) + (32768L*x0)));
                    }
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112234
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-31 17:15:21 +00:00
PyTorch MergeBot
d641450180 Revert "[cpu][inductor] improve cpu vec implementations of log (#111898)"
This reverts commit b570320364.

Reverted https://github.com/pytorch/pytorch/pull/111898 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111898#issuecomment-1780263780))
2023-10-26 01:12:19 +00:00
Liao, Xuan
b570320364 [cpu][inductor] improve cpu vec implementations of log (#111898)
Fixes #110611.

The current Torchinductor's `log` implementations will call `sleef` functions in `aten::Vec` which show worse performance than Aten's `log` implementations that invoke `MKL` functions. The reason is that the `sleef` algorithms sacrifice performance in order to have a higher precision. This PR changes Torchinductor's `log` implementations from the `sleef` functions with `1.0` ULP error bound to the ones with `3.5` ULP error bound.

**Performance**
Machine: ICX

The original perf number, perf with `Sleef_logf16_u10`:
```bash
numactl -C0 python test.py
log
eager:    368.8463559374213
compiled: 616.8672097846866
logit
eager:    565.499295014888
compiled: 1010.4096410796046
```

Perf with `Sleef_logf16_u35`:
```bash
numactl -C0 python test.py
log
eager:    364.8629770614207
compiled: 360.2141812443733
logit
eager:    562.3160391114652
compiled: 545.2622110024095
```

**Accuracy**
error_bound | tol=1e-6 | tol=1e-7
-- | -- | --
1.0 ULP | PASS | FAIL
3.5 ULP | PASS | FAIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111898
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-10-25 01:26:39 +00:00
Elias Ellison
0a147fd112 Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233)
Improves perf of llama_v2 locally from 1.55 -> 1.57

The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise.

Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was:

```
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    iota =  torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False)

    # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0)
    position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]);  unsqueeze = None

    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed
```

Also not sure if I should be more worried about concatting reduction->pointwise inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233
Approved by: https://github.com/Chillee
2023-10-21 02:34:05 +00:00
Aleksei Nikiforov
ba04d84089 S390x inductor support (#111367)
Use arch compile flags. They are needed for vectorization support on s390x.
Implement new helper functions for inductor.

This change fixes multiple tests in test_cpu_repro.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111367
Approved by: https://github.com/ezyang
2023-10-20 19:38:46 +00:00
Oleg Khabinov
8209bbbd06 [AOTInductor] Improve validation for C++ wrapper codegen (#111102)
It's a reimplementation of #111089

1. When using fake inputs make sure they are on the same device as the original inputs.
2. Don't change the value of self.cpp_wrapper from True to False if can't generate a C++ wrapper, instead have a check and fail early to avoid producing Python code for C++ compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111102
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w
2023-10-13 08:46:17 +00:00
chunyuan
20dabea35d Inductor cpp wrapper: support MkldnnRnnLayer (#107858)
1. Directly use the `codegen` function of the parent class which already supported both python and cpp wrapper.
2. The output of the `at::mkldnn_rnn_layer` OP is actually a `std::tuple` 1491bae277/aten/src/ATen/native/mkldnn/RNN.cpp (L218) Fix the type when calling `MultiOutput`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107858
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-29 00:22:42 +00:00
XiaobingSuper
a6b153b311 inductor: remove redundant memory copy when view a ExternKernelAlloc buffer (#108635)
When viewing a ExternKernelAlloc buffer, there always have a redundant memory copy:
```
buf0: ExternKernelSchedulerNode(MKLPackedLinear)
buf0.writes = [StarDep(name='buf0')]
buf0.unmet_dependencies = []
buf0.met_dependencies = [StarDep(name='arg1_1'), StarDep(name='constant0'), StarDep(name='constant1')]
buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True, is_weak=False)]
buf0.node.kernel = torch.ops.mkl._mkl_linear

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep('buf1', c0, {c0: 64})]
buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 64})]
buf1.met_dependencies = []
buf1.users = [NodeUser(node=OUTPUT, can_inplace=False, is_weak=False)]
buf1.group.device = cpu
buf1.group.iteration = ((64,), ())
buf1.sizes = ([64], [])
class buf1_loop_body:
    var_ranges = {z0: 64}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf0', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store
```

and the cpp backend-generated code is:
```
cpp_fused_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(float* in_out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0));
                tmp0.store(in_out_ptr0 + static_cast<long>(i0));
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (4, 16), (16, 1))
    buf0 = torch.ops.mkl._mkl_linear(arg1_1, constant1, constant0, None, 4)
    del arg1_1
    buf1 = reinterpret_tensor(buf0, (4, 4, 4), (16, 4, 1)); del buf0  # reuse
    cpp_fused_view_0(c_void_p(buf1.data_ptr()))
    return (buf1, )
```

For the ExternKernelAlloc buffer, we can do a real view, rather than a memory copy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108635
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #108560
2023-09-11 01:19:37 +00:00
XiaobingSuper
a6ada463ec inductor: make onednn linear inputs are always real contiguous (#108560)
For OneDNN linear, if packed linear inputs are not the default contiguous tensor, it always calls in ref pat and gets a worse performance, this PR will force its inputs to the actual default contiguous tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108560
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-09-11 01:11:36 +00:00
XiaobingSuper
cbf7c91883 inductor: make fallback for cpu scatter_add (#108220)
For inductor cpu backend, the scatter_add will use ```atomic_add```, which get a worse performance, currently, we make fallback for it to avoid performance regression compared with eager mode(single socket of SKX):
```
basic_gnn_gin 1.16x(after) Vs 0.509x(before)

basic_gnn_sage  1.064x(after) Vs 0.496x (before)

basic_gnn_gcn 1.373x(aftre) Vs 0.720x(before)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108220
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-08-31 16:11:07 +00:00
leslie-fang-intel
fdbc2ec5cb [Quant][Inductor] Fix the non contiguous load with uint8 data type (#106958)
**Summary**
Currently, the load vectorization code generation with `non_contiguous` and `uint8` data type has issue in determining the data type. It caused wrong results in `shufflenet_v2_x1_0` model after we enable the `cat` quantization recipe.

- Previously code gen with the example in this PR:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = flag_to_float_scalar(in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]); return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

- After this PR, the code gen is:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ unsigned char tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]; return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

**Test Plan**
```
clear && python -m pytest test_cpu_repro.py -k test_non_contiguous_load_buf_quant
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106958
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #106836, #106838
2023-08-26 16:58:45 +00:00
XiaobingSuper
d2105a8688 inductor: support masked load for cpu path (#107670)
For max_pooling code:

```

#pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(56L); i2+=static_cast<long>(1L))
                    {
                        for(long i3=static_cast<long>(0L); i3<static_cast<long>(64L); i3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i1)));
                            auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                            auto tmp2 = to_float_mask(tmp0 >= tmp1);
                            auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(112));
                            auto tmp4 = to_float_mask(tmp0 < tmp3);
                            auto tmp5 = tmp2 & tmp4;
                            auto tmp6 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i2)));
                            auto tmp7 = to_float_mask(tmp6 >= tmp1);
                            auto tmp8 = to_float_mask(tmp6 < tmp3);
                            auto tmp9 = tmp7 & tmp8;
                            auto tmp10 = tmp5 & tmp9;
                            auto tmp11 = [&]
                            {
                                auto tmp12 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>((-7232L) + i3 + (128L*i2) + (14336L*i1) + (802816L*i0)), 16);
                                                        load
                                auto tmp13 = cvt_lowp_fp_to_fp32<bfloat16>(tmp12);

                                return tmp13;
                            }
                            ;
                            auto tmp14 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10));
```

the index of ```tmp12 ``` may be a correct index, such as ```i1=0, i2=0, i3=0```, the index is ```-7232L```, it is not a valid index. We may meet segmentation fault error when we call ```tmp11()```, the original behavior is that only the ```tmp10```(index check variable) is true, we can safely get the value, this PR will support masked_load to fixing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107670
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-25 21:11:09 +00:00
XiaobingSuper
610f64d72a inductor: also check index_exp when select tiling var (#106765)
For select tiling var, currently, we only consider load and store which do not consider index exp, and meet accuracy issues:

before(the index exp ```i1-1``` can not be vectrized):
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + i1));
                        auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                        auto tmp2 = to_float_mask(tmp0 >= tmp1);
                        auto tmp3 = [&]
                        {
                            auto tmp4 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (8L*i1_inner) + (25088L*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })();
                            auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0)));
                            auto tmp6 = tmp4 * tmp5;
                            return tmp6;
                        }
                        ;
                        auto tmp7 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2));
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp7.store(tmpbuf); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>(i2 + (8L*i1) + (8L*i1_inner) + (25096L*i0))] = tmpbuf[i1_inner]; }
                    }
                }
                #pragma GCC ivdep
                for(long i1=static_cast<long>(3136L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
```

after:
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma omp simd simdlen(8)
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
''')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106765
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-23 07:16:14 +00:00
chunyuan
c21e9de25d Inductor cpp wrapper: fix optional tensor input (#106847)
Fix cpp wrapper failure on `clip` in Torchbench:
```
RuntimeError: tensor does not have a device
```

An `optional<at::Tensor>` variable with value equal to `at::Tensor()` will be considered as _contains value_. When it's converted to `bool`, it returns `true`. While for `None` in python, when converting it to `bool`, `false` is returned.
Fix it to be an optional variable that _does not contain a value_.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106847
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-18 13:20:19 +00:00
Catherine Lee
1cfe292061 Mark test_lstm_packed as slow (#107048)
The test takes >30 minutes to run on some configurations and keeps getting unmarked as slow by the automatic slow test detection.
Examples:
https://ossci-raw-job-status.s3.amazonaws.com/log/15824750763
https://ossci-raw-job-status.s3.amazonaws.com/log/15802766247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107048
Approved by: https://github.com/huydhn
2023-08-11 20:35:14 +00:00
Yanbo Liang
df8abaaf5f [Dynamo] Revert 'Enable torch._dynamo.config.suppress_errors by default' (#106562)
D47969512 was the original diff to revert this, but the diff train doesn't work well, so I have to split it into two part: this OSS PR and another separate diff to revert the fbcode change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106562
Approved by: https://github.com/angelayi
2023-08-04 16:46:21 +00:00