Commit Graph

216 Commits

Author SHA1 Message Date
Yang Chen
1392843e7b [inductor] make sure bitcast input and target type have the same bitwidth (#115619)
This PR fixed #104791

bitcast requires the source and target have the bitwidth.
Because the input tensor's dtype could be promoted, e.g. from float16 to
float, we have to cast the tensor to its original source dtype before
invoking bitcast in such cases. After that, we also need to convert
the bit-casted tensor back to float to make sure we keep using higher
precision values for the rest of the computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619
Approved by: https://github.com/jansel, https://github.com/eellison
2023-12-13 00:53:04 +00:00
Jiong Gong
bfa2c844a8 [inductor][cpp] avoid redundant lowp type cast for direct load/store (#115006)
Fix https://github.com/pytorch/pytorch/issues/114879. See https://github.com/pytorch/pytorch/issues/114879#issuecomment-1836977610 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115006
Approved by: https://github.com/jansel
2023-12-04 06:39:27 +00:00
Jez Ng
f1fd02503b Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974)
This is a backout of #113747 which reverted the above two commits. Now that
#113997 has landed, this diff can be landed safely without breaking ABI compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974
Approved by: https://github.com/chenyang78
2023-12-02 03:25:51 +00:00
vfdev
c867fddab5 [inductor] Fix in CppPrinter._print_Pow (#114872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114872
Approved by: https://github.com/lezcano
2023-11-30 20:21:44 +00:00
Jez Ng
71b742b42c [inductor] Remove more type: ignore comments (#114162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114162
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-11-28 06:45:55 +00:00
Xu Han
0f887a6d1a limit fused kernel num args. (#113131)
Fixes #97361

When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.

Code change:

1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
2023-11-22 18:05:33 +00:00
Jiong Gong
a0e3321f0c [inductor cpp] vectorize embedding lookup (#114062)
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.

This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.

For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
                    auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
                    out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
                }
            }
        }
    }
}
```

After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
                    auto tmp6 = tmp4 + tmp5;
                    tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
2023-11-22 11:19:42 +00:00
PyTorch MergeBot
dd6ef0877e Revert "[inductor cpp] vectorize embedding lookup (#114062)"
This reverts commit 2c0474c02d.

Reverted https://github.com/pytorch/pytorch/pull/114062 on behalf of https://github.com/huydhn due to Sorry for reverting your change, please help fix lint and reland it 2c0474c02d ([comment](https://github.com/pytorch/pytorch/pull/114062#issuecomment-1820526515))
2023-11-21 09:21:20 +00:00
Jiong Gong
2c0474c02d [inductor cpp] vectorize embedding lookup (#114062)
For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.

This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed.

For the added UT `test_embedding_vec`, the generated code before this PR is:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
                    auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
                    out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
                }
            }
        }
    }
}
```

After this PR, we have:
```c++
extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
                    auto tmp6 = tmp4 + tmp5;
                    tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114062
Approved by: https://github.com/jansel
ghstack dependencies: #113950
2023-11-21 07:37:15 +00:00
PyTorch MergeBot
ff7c06a01b Revert "limit fused kernel num args. (#113131)"
This reverts commit 7b442c2b0a.

Reverted https://github.com/pytorch/pytorch/pull/113131 on behalf of https://github.com/albanD due to Breaks lint on trunk ([comment](https://github.com/pytorch/pytorch/pull/113131#issuecomment-1817548349))
2023-11-18 16:14:08 +00:00
Jiong Gong
b53d47a719 [inductor cpp] refactor: CppVecOverrides inherits CppOverrides (#113950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113950
Approved by: https://github.com/Skylion007
2023-11-18 15:33:30 +00:00
Han, Xu
7b442c2b0a limit fused kernel num args. (#113131)
Fixes #97361

When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.

Code change:

1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
2023-11-18 03:55:52 +00:00
Wei Wei
b19cf868e8 Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527)" (#113747)
Test Plan: sandcastle

Differential Revision: D51330618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-11-15 22:42:22 +00:00
Jiong Gong
fcdfcdeef9 [inductor cpp] fix non-contiguous reduction store (#113261)
Fix https://github.com/pytorch/pytorch/issues/113018

The reduction store in this case works on non-contiguous buffer. Previously, we only do scalar fallback for normal stores but not reduction stores. This PR fixes this.

Before fix
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0_vec.store(out_ptr1 + static_cast<long>(x0 + (39L*x1))); // this is wrong since x0 is not vector dim
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }
```

After fix
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp_acc0_vec.store(tmpbuf); for (long x1_inner = 0; x1_inner < 16; x1_inner++) out_ptr1[static_cast<long>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; }
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113261
Approved by: https://github.com/lezcano
2023-11-15 03:27:17 +00:00
Aaron Gokaslan
b7b2178204 [BE]: Remove useless lambdas (#113602)
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
2023-11-14 20:06:48 +00:00
Jez Ng
5e03af8295 [inductor] Enable floor_div indexing to work under ABI-compat mode (#113276)
Previously, floor_div operations were defined in
ATen/native/BinaryOps.h. Since this header was not included under
ABI-compat mode, trying to use those indexing operations would result in
compilation errors.

Technically, it is safe to use aten::native::floor_div_* functions in
ABI-compat mode as they are header-only; we could simply include
BinaryOps.h. However, there are other declarations in BinaryOps.h that
are not binary-compatible, so this is not ideal. Thus, I have moved those
functions into a separate file, and put them under c10/util, since they
don't really have tensor-specific logic.

c10 functions are not all header-only, so this still isn't ideal, but
this still seems like an improvement. Moreover, cpp_prefix.h -- used
when compiling cpp kernels -- already includes c10 header files, so
ABI-compatibility already depends on maintaining some c10 functions as
header-only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-11 02:51:29 +00:00
Jiong Gong
cb48f7855a [inductor cpu] fix uint8 add and sub (#113253)
Fix https://github.com/pytorch/pytorch/issues/113016 and https://github.com/pytorch/pytorch/issues/113020 and https://github.com/pytorch/pytorch/issues/113141 and https://github.com/pytorch/pytorch/issues/113143 and https://github.com/pytorch/pytorch/issues/113144
Explicit typecast result of add/sub to uint8 (similar to how we fixed mul previously) to avoid implicit type promotion from C.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113253
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-11-10 04:06:42 +00:00
Jiong Gong
8c704f7a0e [inductor cpp] fix argmax with >1 reduction dims (#113168)
Fix #113013.

The argmax (and argmin) implementation doesn't handle the index compute properly when the number of reduction dims is larger than 1. It wrongly assumed only one reduction dim.

With the given reproducer, the generated code before the change:
```c++
#include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h"
extern "C" void kernel(const double* in_ptr0,
                       long* out_ptr0)
{
    {
        {
            struct IndexValue_1 {size_t index; double value;};
            IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()};
            #if !defined(__clang_major__) || __clang_major__ > 9
            #pragma omp declare reduction(argmax : IndexValue_1 :\
                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
            	initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()})
            #endif
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = c10::convert<long>(0);
                    auto tmp1 = c10::convert<long>(1);
                    auto tmp2 = tmp0 < tmp1;
                    auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3L*x1), 2L));
                    auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp5 = tmp3 < tmp4;
                    auto tmp6 = tmp2 & tmp5;
                    auto tmp7 = [&]
                    {
                        auto tmp8 = in_ptr0[static_cast<long>((3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp8;
                    }
                    ;
                    auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                    auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp11 = tmp10 < tmp4;
                    auto tmp12 = tmp2 & tmp11;
                    auto tmp13 = [&]
                    {
                        auto tmp14 = in_ptr0[static_cast<long>(1L + (3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp14;
                    }
                    ;
                    auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                    auto tmp16 = tmp15 + tmp9;
                    auto tmp17 = [&]
                    {
                        auto tmp18 = c10::convert<double>(1.0);
                        return tmp18;
                    }
                    ;
                    auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0);
                    auto tmp20 = [&]
                    {
                        auto tmp21 = c10::convert<double>(1.0);
                        return tmp21;
                    }
                    ;
                    auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0);
                    auto tmp23 = tmp22 + tmp19;
                    auto tmp24 = tmp16 / tmp23;
                    if (tmp_acc0.value < tmp24) {
                        tmp_acc0.index = x1; tmp_acc0.value = tmp24; // both x0 and x1 are reduction vars while only x1 is assigned to tmp_acc0.index
                    }
                }
            }
            out_ptr0[static_cast<long>(0L)] = tmp_acc0.index;
        }
    }
}
```
After fix:
```c++
#include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h"
extern "C" void kernel(const double* in_ptr0,
                       long* out_ptr0)
{
    {
        {
            struct IndexValue_1 {size_t index; double value;};
            IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()};
            #if !defined(__clang_major__) || __clang_major__ > 9
            #pragma omp declare reduction(argmax : IndexValue_1 :\
                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
            	initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()})
            #endif
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = c10::convert<long>(0);
                    auto tmp1 = c10::convert<long>(1);
                    auto tmp2 = tmp0 < tmp1;
                    auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3L*x1), 2L));
                    auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp5 = tmp3 < tmp4;
                    auto tmp6 = tmp2 & tmp5;
                    auto tmp7 = [&]
                    {
                        auto tmp8 = in_ptr0[static_cast<long>((3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp8;
                    }
                    ;
                    auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                    auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3L*x1), 2L)));
                    auto tmp11 = tmp10 < tmp4;
                    auto tmp12 = tmp2 & tmp11;
                    auto tmp13 = [&]
                    {
                        auto tmp14 = in_ptr0[static_cast<long>(1L + (3L*x0) + (at::native::div_floor_integer((3L*x1), 2L)))];
                        return tmp14;
                    }
                    ;
                    auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                    auto tmp16 = tmp15 + tmp9;
                    auto tmp17 = [&]
                    {
                        auto tmp18 = c10::convert<double>(1.0);
                        return tmp18;
                    }
                    ;
                    auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0);
                    auto tmp20 = [&]
                    {
                        auto tmp21 = c10::convert<double>(1.0);
                        return tmp21;
                    }
                    ;
                    auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0);
                    auto tmp23 = tmp22 + tmp19;
                    auto tmp24 = tmp16 / tmp23;
                    if (tmp_acc0.value < tmp24) {
                        tmp_acc0.index = static_cast<long>(x1 + (2L*x0)); tmp_acc0.value = tmp24;
                    }
                }
            }
            out_ptr0[static_cast<long>(0L)] = tmp_acc0.index;
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113168
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-11-09 11:47:51 +00:00
Jez Ng
297c26bb8e Support fp8 in AOTInductor + support optional<> in C ABI (#112527)
This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358

It turns out that we need to add support for optional types in order to
support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface
can't support optional<> directly, I am passing in optional types via
pointer instead.

`AtenTensorHandle`s are already pointers, so nothing needs to change
there. Only value types need to change.

We decided on this approach instead of adding an extra `bool` param to
the callee because this simplifies things. Having the same number of
arguments regardless of whether we are emitting Python / C++ /
ABI-compatible C++ makes codegen easier.

There are a number of existing ABI-compatible functions that have
optional-typed value parameters. Previously, they just assumed they
would never be passed a `nullopt` / `None` at runtime. Changing them to
use pointer types now would break ABI stability, so I have created an
exclude list for those functions.

Finally, I think the current implementation is kind of messy, and only
works for FallbackKernels, even though technically ExternKernels could
also have the same issue. It also doesn't support optional types nested
in lists. I've left FIXME comments for both issues.

Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-08 22:56:48 +00:00
Jez Ng
dc63248b76 Make dynamo configs more amenable to static type checking (#112130)
`install_config_module` makes a regular module into a ConfigModule with
extra methods defined on it. mypy thinks those extra methods (or module
functions) are undefined since it cannot analyze something so
dynamic. As a workaround, I've created a fake module that defines these
extra functions, which I import into the config modules during type
checking.

As part of this change, I've also added more types to config_utils.py
and enabled typechecking for torch/_dynamo/config.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130
Approved by: https://github.com/jansel
2023-11-08 21:17:45 +00:00
Aaron Gokaslan
8219bf051b [BE]: Apply RUF015 to torch folder (#113025)
Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-07 00:48:15 +00:00
Peter Bell
718035791d Prefer e.is_number over not e.free_symbols in SymPy (#112688)
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.

> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.

Even further, we also avoid the overhead of building the unnecessary set object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
2023-11-06 20:05:13 +00:00
Jiong Gong
e061144aaf [inductor] replace ops.div with ops.truediv (#112243)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243
Approved by: https://github.com/lezcano
ghstack dependencies: #112234
2023-11-01 05:50:51 +00:00
Shunting Zhang
fbafff3668 [reland][inductor] benchmark fusion (#112450)
reland https://github.com/pytorch/pytorch/pull/108193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450
Approved by: https://github.com/jansel
2023-10-31 18:17:06 +00:00
Jiong Gong
a1c56df1f0 [inductor cpp] vectorize support for truediv (#112234)
Ops like group_norm has `ops.truediv` that doesn't have vectorization support yet. This PR adds the support.

`test_group_norm_vec`
Before:
```c++
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
            {
                {
                    #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                    #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                    {
                        auto tmp0 = in_ptr0[static_cast<long>(x2 + (1024L*x1) + (32768L*x0))];
                        auto tmp1 = out_ptr0[static_cast<long>(x1 + (32L*x0))];
                        auto tmp3 = out_ptr1[static_cast<long>(x1 + (32L*x0))];
                        auto tmp10 = in_ptr1[static_cast<long>(x1)];
                        auto tmp12 = in_ptr2[static_cast<long>(x1)];
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = c10::convert<float>(1024.0);
                        auto tmp5 = tmp3 / tmp4;
                        auto tmp6 = c10::convert<float>(1e-05);
                        auto tmp7 = tmp5 + tmp6;
                        auto tmp8 = 1 / std::sqrt(tmp7);
                        auto tmp9 = decltype(tmp2)(tmp2 * tmp8);
                        auto tmp11 = decltype(tmp9)(tmp9 * tmp10);
                        auto tmp13 = tmp11 + tmp12;
                        out_ptr2[static_cast<long>(x2 + (1024L*x1) + (32768L*x0))] = tmp13;
                    }
                }
            }
        }
    }
}
```

After:
```c++
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
            {
                {
                    #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                    #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (1024L*x1) + (32768L*x0)));
                        auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(out_ptr0[static_cast<long>(x1 + (32L*x0))]));
                        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(out_ptr1[static_cast<long>(x1 + (32L*x0))]));
                        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(in_ptr1[static_cast<long>(x1)]));
                        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(in_ptr2[static_cast<long>(x1)]));
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1024.0));
                        auto tmp5 = tmp3 / tmp4;
                        auto tmp6 = at::vec::Vectorized<float>(static_cast<float>(1e-05));
                        auto tmp7 = tmp5 + tmp6;
                        auto tmp8 = tmp7.rsqrt();
                        auto tmp9 = tmp2 * tmp8;
                        auto tmp11 = tmp9 * tmp10;
                        auto tmp13 = tmp11 + tmp12;
                        tmp13.store(out_ptr2 + static_cast<long>(x2 + (1024L*x1) + (32768L*x0)));
                    }
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112234
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-31 17:15:21 +00:00
PyTorch MergeBot
fc0b0820fc Revert "Readded device_assert skipping in index and index_put (and also added (#112093)"
This reverts commit b110d87ac2.

Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1785922905))
2023-10-30 19:45:41 +00:00
chilli
b110d87ac2 Readded device_assert skipping in index and index_put (and also added (#112093)
copy to noop pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093
Approved by: https://github.com/oulgen, https://github.com/lezcano
2023-10-27 18:23:49 +00:00
PyTorch MergeBot
64fd027f2e Revert "[inductor] benchmark fusion (#108193)"
This reverts commit 73cc5d1cdd.

Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))
2023-10-27 01:40:06 +00:00
PyTorch MergeBot
0a3199dd7e Revert "Readded device_assert skipping in index and index_put (and also added (#112093)"
This reverts commit e38347f490.

Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/izaitsevfb due to Sorry, trying to resolve a conflict with intern, and unblock the revert of #108690 ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1782154814))
2023-10-27 01:37:33 +00:00
Shunting Zhang
73cc5d1cdd [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 22:18:37 +00:00
PyTorch MergeBot
485cc0faae Revert "[inductor] benchmark fusion (#108193)"
This reverts commit ec0cdcdf6a.

Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))
2023-10-26 16:41:20 +00:00
chilli
e38347f490 Readded device_assert skipping in index and index_put (and also added (#112093)
copy to noop pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093
Approved by: https://github.com/oulgen, https://github.com/lezcano
ghstack dependencies: #111990
2023-10-26 07:54:44 +00:00
Shunting Zhang
ec0cdcdf6a [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 04:14:22 +00:00
Guilherme Leobas
f97c2dabd9 Move negative index checking to common.py - Fix issue 97365 (#108690)
Fixes https://github.com/pytorch/pytorch/issues/97365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108690
Approved by: https://github.com/lezcano
2023-10-24 17:27:54 +00:00
Jiong Gong
8bc04f46fe [inductor cpp] use c10::bit_cast to avoid violating strict-aliasing (#110809)
Fix https://github.com/pytorch/pytorch/issues/110807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110809
Approved by: https://github.com/jansel
2023-10-10 11:16:31 +00:00
Peter Bell
dc794ec32c [dynamo] Trace through builtin abs (#110398)
In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do
the same in dynamo. This also adds `SymNode.__abs__` so we can trace through
indexing expressions involving `abs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398
Approved by: https://github.com/jansel, https://github.com/lezcano
2023-10-03 19:25:37 +00:00
Alexander Grund
e0348ceceb Avoid undefined behavior in JIT-generated conversion code (#110212)
The inductor/dynamo JIT generator creates C++ code using `static_cast` for type conversions.
This is can be undefined behavior for e.g. `static_cast<uint8_t>(floatVal)` where `floatVal` is a negative value.

To avoid this in the "regular" C++ code `c10::convert` is used. So use it in the JIT generated code too.

Fixes #110077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110212
Approved by: https://github.com/ezyang, https://github.com/jgong5, https://github.com/desertfire
2023-10-02 12:56:41 +00:00
leslie-fang-intel
7eeb392eb3 [Inductor] Enable the item() and nonzero() codegen test on CPU (#110262)
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:

-  Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.

**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-09-30 00:13:20 +00:00
Edward Z. Yang
d1a13129bb Add support for item() and nonzero() codegen in Inductor (#109893)
This is another version of
https://github.com/pytorch/pytorch/pull/109262 that I think is more
harmonious with inductor design.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109893
Approved by: https://github.com/jansel
2023-09-28 23:37:31 +00:00
Sam Larsen
7ed06e8317 [inductor] enable mypy checking in torch/_inductor/codegen/cpp.py (#109729)
Summary: Add enough typehints / ignores to enable mypy checking in torch/_inductor/codegen/cpp.py

Test Plan: lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109729
Approved by: https://github.com/Skylion007
2023-09-25 22:53:05 +00:00
Ying Zhang
bbdce93571 Basic fp8 support in Inductor (#109168)
Add basic fp8 support in Inductor, including:
* Fix fp8 Triton codegen issues;
* Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from 10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10).

Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168
Approved by: https://github.com/drisspg
2023-09-23 04:41:41 +00:00
Nikita Shulga
a9bf1031d4 [BE] Do not use numpy in torch._inductor.codegen.cpp (#109324)
`s/numpy.iinfo(numpy.int32)/torch.iinfo(torch.int32)/` as those two are interchangeable

Partially addresses https://github.com/pytorch/pytorch/issues/109387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109324
Approved by: https://github.com/albanD
2023-09-15 17:29:10 +00:00
Ying Zhang
097fd43f8c [Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931)
This is the step 4 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931
Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng
ghstack dependencies: #107802, #107847, #107901
2023-09-12 17:44:38 +00:00
Sherlock Huang
b9dfdc091b [AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279) (#108350)
Summary:

This is a prototype for running extern fallback kernels with a host side proxy executor.

Sample of generated cpp wrapper call:
```
        at::Tensor buf0;  // output buffer
        void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
        int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
        proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```

- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.

- These are all the input argument types I have support so far.
       union Argument {
         # Bool value does not matter
         1: bool asNone;
         2: TensorArgument asTensor;
         3: list<TensorArgument> asTensors;
         5: i64 asInt;
         7: list<i64> asInts;
         8: double asFloat;
         9: list<double> asFloats;
         10: string asString;
         10.5: list<string> asStrings;
         11: SymIntArgument asSymInt;
         12: list<SymIntArgument> asSymInts;
         13: ScalarType asScalarType;
         14: MemoryFormat asMemoryFormat;
         15: Layout asLayout;
         16: Device asDevice;
         17: bool asBool;
         18: list<bool> asBools;
       }

- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC  implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.

For fixing T162112344

Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main

test:
 buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops

backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main

buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'

Reviewed By: suo

Differential Revision: D48747417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb
2023-09-02 17:14:10 +00:00
Shunting Zhang
7cb4bf675b [inductor] no-side-effect codegen (#107617)
Inductor kernel codegen previously have the following side effect:
- in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers
- during codegen, we do memory allocation/free.

These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617
Approved by: https://github.com/jansel
2023-08-31 00:25:17 +00:00
leslie-fang-intel
fdbc2ec5cb [Quant][Inductor] Fix the non contiguous load with uint8 data type (#106958)
**Summary**
Currently, the load vectorization code generation with `non_contiguous` and `uint8` data type has issue in determining the data type. It caused wrong results in `shufflenet_v2_x1_0` model after we enable the `cat` quantization recipe.

- Previously code gen with the example in this PR:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = flag_to_float_scalar(in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]); return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

- After this PR, the code gen is:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ unsigned char tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]; return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

**Test Plan**
```
clear && python -m pytest test_cpu_repro.py -k test_non_contiguous_load_buf_quant
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106958
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #106836, #106838
2023-08-26 16:58:45 +00:00
XiaobingSuper
d2105a8688 inductor: support masked load for cpu path (#107670)
For max_pooling code:

```

#pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(56L); i2+=static_cast<long>(1L))
                    {
                        for(long i3=static_cast<long>(0L); i3<static_cast<long>(64L); i3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i1)));
                            auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                            auto tmp2 = to_float_mask(tmp0 >= tmp1);
                            auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(112));
                            auto tmp4 = to_float_mask(tmp0 < tmp3);
                            auto tmp5 = tmp2 & tmp4;
                            auto tmp6 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i2)));
                            auto tmp7 = to_float_mask(tmp6 >= tmp1);
                            auto tmp8 = to_float_mask(tmp6 < tmp3);
                            auto tmp9 = tmp7 & tmp8;
                            auto tmp10 = tmp5 & tmp9;
                            auto tmp11 = [&]
                            {
                                auto tmp12 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>((-7232L) + i3 + (128L*i2) + (14336L*i1) + (802816L*i0)), 16);
                                                        load
                                auto tmp13 = cvt_lowp_fp_to_fp32<bfloat16>(tmp12);

                                return tmp13;
                            }
                            ;
                            auto tmp14 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10));
```

the index of ```tmp12 ``` may be a correct index, such as ```i1=0, i2=0, i3=0```, the index is ```-7232L```, it is not a valid index. We may meet segmentation fault error when we call ```tmp11()```, the original behavior is that only the ```tmp10```(index check variable) is true, we can safely get the value, this PR will support masked_load to fixing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107670
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-25 21:11:09 +00:00
lezcano
2b6249e209 Wrap indirect indexing on CUDA (#105055)
Lifting this to CPU should be rather easy. @jgong5
Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well.

This fix works with dynamic shapes as well.

@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-23 11:59:20 +00:00
XiaobingSuper
610f64d72a inductor: also check index_exp when select tiling var (#106765)
For select tiling var, currently, we only consider load and store which do not consider index exp, and meet accuracy issues:

before(the index exp ```i1-1``` can not be vectrized):
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + i1));
                        auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                        auto tmp2 = to_float_mask(tmp0 >= tmp1);
                        auto tmp3 = [&]
                        {
                            auto tmp4 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (8L*i1_inner) + (25088L*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })();
                            auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0)));
                            auto tmp6 = tmp4 * tmp5;
                            return tmp6;
                        }
                        ;
                        auto tmp7 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2));
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp7.store(tmpbuf); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>(i2 + (8L*i1) + (8L*i1_inner) + (25096L*i0))] = tmpbuf[i1_inner]; }
                    }
                }
                #pragma GCC ivdep
                for(long i1=static_cast<long>(3136L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
```

after:
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma omp simd simdlen(8)
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
''')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106765
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-23 07:16:14 +00:00
PyTorch MergeBot
b282787409 Revert "Wrap indirect indexing on CUDA (#105055)"
This reverts commit 85c673e6b2.

Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))
2023-08-22 20:24:41 +00:00