Commit Graph

187 Commits

Author SHA1 Message Date
Shunting Zhang
73cc5d1cdd [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 22:18:37 +00:00
PyTorch MergeBot
485cc0faae Revert "[inductor] benchmark fusion (#108193)"
This reverts commit ec0cdcdf6a.

Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))
2023-10-26 16:41:20 +00:00
chilli
e38347f490 Readded device_assert skipping in index and index_put (and also added (#112093)
copy to noop pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093
Approved by: https://github.com/oulgen, https://github.com/lezcano
ghstack dependencies: #111990
2023-10-26 07:54:44 +00:00
Shunting Zhang
ec0cdcdf6a [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 04:14:22 +00:00
Guilherme Leobas
f97c2dabd9 Move negative index checking to common.py - Fix issue 97365 (#108690)
Fixes https://github.com/pytorch/pytorch/issues/97365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108690
Approved by: https://github.com/lezcano
2023-10-24 17:27:54 +00:00
Jiong Gong
8bc04f46fe [inductor cpp] use c10::bit_cast to avoid violating strict-aliasing (#110809)
Fix https://github.com/pytorch/pytorch/issues/110807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110809
Approved by: https://github.com/jansel
2023-10-10 11:16:31 +00:00
Peter Bell
dc794ec32c [dynamo] Trace through builtin abs (#110398)
In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do
the same in dynamo. This also adds `SymNode.__abs__` so we can trace through
indexing expressions involving `abs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398
Approved by: https://github.com/jansel, https://github.com/lezcano
2023-10-03 19:25:37 +00:00
Alexander Grund
e0348ceceb Avoid undefined behavior in JIT-generated conversion code (#110212)
The inductor/dynamo JIT generator creates C++ code using `static_cast` for type conversions.
This is can be undefined behavior for e.g. `static_cast<uint8_t>(floatVal)` where `floatVal` is a negative value.

To avoid this in the "regular" C++ code `c10::convert` is used. So use it in the JIT generated code too.

Fixes #110077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110212
Approved by: https://github.com/ezyang, https://github.com/jgong5, https://github.com/desertfire
2023-10-02 12:56:41 +00:00
leslie-fang-intel
7eeb392eb3 [Inductor] Enable the item() and nonzero() codegen test on CPU (#110262)
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:

-  Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.

**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-09-30 00:13:20 +00:00
Edward Z. Yang
d1a13129bb Add support for item() and nonzero() codegen in Inductor (#109893)
This is another version of
https://github.com/pytorch/pytorch/pull/109262 that I think is more
harmonious with inductor design.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109893
Approved by: https://github.com/jansel
2023-09-28 23:37:31 +00:00
Sam Larsen
7ed06e8317 [inductor] enable mypy checking in torch/_inductor/codegen/cpp.py (#109729)
Summary: Add enough typehints / ignores to enable mypy checking in torch/_inductor/codegen/cpp.py

Test Plan: lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109729
Approved by: https://github.com/Skylion007
2023-09-25 22:53:05 +00:00
Ying Zhang
bbdce93571 Basic fp8 support in Inductor (#109168)
Add basic fp8 support in Inductor, including:
* Fix fp8 Triton codegen issues;
* Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from 10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10).

Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168
Approved by: https://github.com/drisspg
2023-09-23 04:41:41 +00:00
Nikita Shulga
a9bf1031d4 [BE] Do not use numpy in torch._inductor.codegen.cpp (#109324)
`s/numpy.iinfo(numpy.int32)/torch.iinfo(torch.int32)/` as those two are interchangeable

Partially addresses https://github.com/pytorch/pytorch/issues/109387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109324
Approved by: https://github.com/albanD
2023-09-15 17:29:10 +00:00
Ying Zhang
097fd43f8c [Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931)
This is the step 4 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931
Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng
ghstack dependencies: #107802, #107847, #107901
2023-09-12 17:44:38 +00:00
Sherlock Huang
b9dfdc091b [AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279) (#108350)
Summary:

This is a prototype for running extern fallback kernels with a host side proxy executor.

Sample of generated cpp wrapper call:
```
        at::Tensor buf0;  // output buffer
        void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
        int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
        proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```

- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.

- These are all the input argument types I have support so far.
       union Argument {
         # Bool value does not matter
         1: bool asNone;
         2: TensorArgument asTensor;
         3: list<TensorArgument> asTensors;
         5: i64 asInt;
         7: list<i64> asInts;
         8: double asFloat;
         9: list<double> asFloats;
         10: string asString;
         10.5: list<string> asStrings;
         11: SymIntArgument asSymInt;
         12: list<SymIntArgument> asSymInts;
         13: ScalarType asScalarType;
         14: MemoryFormat asMemoryFormat;
         15: Layout asLayout;
         16: Device asDevice;
         17: bool asBool;
         18: list<bool> asBools;
       }

- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC  implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.

For fixing T162112344

Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main

test:
 buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops

backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main

buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'

Reviewed By: suo

Differential Revision: D48747417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb
2023-09-02 17:14:10 +00:00
Shunting Zhang
7cb4bf675b [inductor] no-side-effect codegen (#107617)
Inductor kernel codegen previously have the following side effect:
- in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers
- during codegen, we do memory allocation/free.

These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617
Approved by: https://github.com/jansel
2023-08-31 00:25:17 +00:00
leslie-fang-intel
fdbc2ec5cb [Quant][Inductor] Fix the non contiguous load with uint8 data type (#106958)
**Summary**
Currently, the load vectorization code generation with `non_contiguous` and `uint8` data type has issue in determining the data type. It caused wrong results in `shufflenet_v2_x1_0` model after we enable the `cat` quantization recipe.

- Previously code gen with the example in this PR:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = flag_to_float_scalar(in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]); return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

- After this PR, the code gen is:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ unsigned char tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]; return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

**Test Plan**
```
clear && python -m pytest test_cpu_repro.py -k test_non_contiguous_load_buf_quant
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106958
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #106836, #106838
2023-08-26 16:58:45 +00:00
XiaobingSuper
d2105a8688 inductor: support masked load for cpu path (#107670)
For max_pooling code:

```

#pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(56L); i2+=static_cast<long>(1L))
                    {
                        for(long i3=static_cast<long>(0L); i3<static_cast<long>(64L); i3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i1)));
                            auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                            auto tmp2 = to_float_mask(tmp0 >= tmp1);
                            auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(112));
                            auto tmp4 = to_float_mask(tmp0 < tmp3);
                            auto tmp5 = tmp2 & tmp4;
                            auto tmp6 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i2)));
                            auto tmp7 = to_float_mask(tmp6 >= tmp1);
                            auto tmp8 = to_float_mask(tmp6 < tmp3);
                            auto tmp9 = tmp7 & tmp8;
                            auto tmp10 = tmp5 & tmp9;
                            auto tmp11 = [&]
                            {
                                auto tmp12 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>((-7232L) + i3 + (128L*i2) + (14336L*i1) + (802816L*i0)), 16);
                                                        load
                                auto tmp13 = cvt_lowp_fp_to_fp32<bfloat16>(tmp12);

                                return tmp13;
                            }
                            ;
                            auto tmp14 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10));
```

the index of ```tmp12 ``` may be a correct index, such as ```i1=0, i2=0, i3=0```, the index is ```-7232L```, it is not a valid index. We may meet segmentation fault error when we call ```tmp11()```, the original behavior is that only the ```tmp10```(index check variable) is true, we can safely get the value, this PR will support masked_load to fixing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107670
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-25 21:11:09 +00:00
lezcano
2b6249e209 Wrap indirect indexing on CUDA (#105055)
Lifting this to CPU should be rather easy. @jgong5
Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well.

This fix works with dynamic shapes as well.

@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-23 11:59:20 +00:00
XiaobingSuper
610f64d72a inductor: also check index_exp when select tiling var (#106765)
For select tiling var, currently, we only consider load and store which do not consider index exp, and meet accuracy issues:

before(the index exp ```i1-1``` can not be vectrized):
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + i1));
                        auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                        auto tmp2 = to_float_mask(tmp0 >= tmp1);
                        auto tmp3 = [&]
                        {
                            auto tmp4 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (8L*i1_inner) + (25088L*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })();
                            auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0)));
                            auto tmp6 = tmp4 * tmp5;
                            return tmp6;
                        }
                        ;
                        auto tmp7 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2));
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp7.store(tmpbuf); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>(i2 + (8L*i1) + (8L*i1_inner) + (25096L*i0))] = tmpbuf[i1_inner]; }
                    }
                }
                #pragma GCC ivdep
                for(long i1=static_cast<long>(3136L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
```

after:
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma omp simd simdlen(8)
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
''')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106765
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-23 07:16:14 +00:00
PyTorch MergeBot
b282787409 Revert "Wrap indirect indexing on CUDA (#105055)"
This reverts commit 85c673e6b2.

Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))
2023-08-22 20:24:41 +00:00
lezcano
85c673e6b2 Wrap indirect indexing on CUDA (#105055)
Lifting this to CPU should be rather easy. @jgong5
Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well.

This fix works with dynamic shapes as well.

@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-22 01:06:35 +00:00
Peter Bell
18b1c2907d [inductor] Add ir.WelfordReduction with multiple outputs (#104725)
This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously.

Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725
Approved by: https://github.com/lezcano
2023-08-18 08:18:01 +00:00
Wang, Eikan
9921b48558 Extend Inductor to support the third-party backend (#106874)
## Summary

This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression.

## Root Cause

Regarding the C++/OpenMP backend,  `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`.
c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)

In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend.

```python
def init_backend_registration(self):
    if get_scheduling_for_device("cpu") is None:
        from .codegen.cpp import CppScheduling

        register_backend_for_device("cpu", CppScheduling, WrapperCodeGen)

    if get_scheduling_for_device("cuda") is None:
        from .codegen.triton import TritonScheduling

        register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen)
```

## Solution

To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back.

## Compilation Latency Performance Result
We ran a single model benchmark and reproduced the compilation regression:

- Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart`

- W/ PR #100706, the compilation latency is about **57~58**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7
```

- W/O PR #100706, the compilation latency is about **46~47**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7
```

This PR fixed the compilation performance regression.

- W/ this PR #106874, the compilation latency is about **47~48**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874
Approved by: https://github.com/jansel
2023-08-16 04:11:36 +00:00
Yanbo Liang
1819fe1324 Revert "Extend Inductor to support the third-party backend (#100706)" (#106652)
This reverts commit 05bd24bb35.

It caused compilation time regression on torchbench, huggingface and dynamic models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-08-05 06:41:08 +00:00
haozhe.zhu
60237ccbdf fix bf16 constant accuracy (#105827)
This PR aims to sort out the data type for `constant`.

The constant should be promoted to float https://github.com/pytorch/pytorch/pull/105440. So there are serval changes to do:
 - Data type propagation should propagate constant node to `float` dtype if original dtype is `bfloat16`
 - We do not need to insert `to_dtype` after the `constant` node, directly init an `fp32` constant is faster.
```
    vectorized<bfloat16> tmp(value);
    vectorized <float> tmp1 = cvt_bf16_fp32(tmp);
->
    vectorized<float> tmp(value);
```
 - move `constant` out of the list for `all operations can support bf16 without converting to fp32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105827
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-03 01:17:50 +00:00
Wang, Eikan
05bd24bb35 Extend Inductor to support the third-party backend (#100706)
This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done.

Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code.

- Python wrapper code generation

  Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions.

- Kernel code generation

  It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions.

  -   [group_fn](71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64))
  - [flush](71c4becda7/torch/_inductor/scheduler.py (L1150))
  - [can_fuse_vertical](71c4becda7/torch/_inductor/scheduler.py (L1006))
  - [can_fuse_horizontal](71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64))
  - [codegen_template](71c4becda7/torch/_inductor/scheduler.py (L1234)) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._
  - [codegen_nodes](71c4becda7/torch/_inductor/scheduler.py (L1234))
  - [codegen_sync](71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._

  The third-party backend needs to inherit from the `Scheduling` class and implement these functions.

Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706
Approved by: https://github.com/jansel
2023-08-02 05:13:51 +00:00
haozhe.zhu
952021934f inductor: legalize fp16 (#100857)
This PR aims to vectorize FP16 for CPU with what BF16 has done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100857
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-27 02:31:40 +00:00
PyTorch MergeBot
dfc9874740 Revert "inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440)"
This reverts commit 18bcf62bbc.

Reverted https://github.com/pytorch/pytorch/pull/105440 on behalf of https://github.com/XiaobingSuper due to introduce core dumped when init bfloat16 zero tensor ([comment](https://github.com/pytorch/pytorch/pull/105440#issuecomment-1643079005))
2023-07-20 03:56:44 +00:00
Justin Chu
cb7a30f656 [BE] Enable ruff's UP rules and autoformat inductor/ (#105431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431
Approved by: https://github.com/albanD
2023-07-19 13:45:00 +00:00
XiaobingSuper
18bcf62bbc inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440)
As scalar path, we should also promote half/bfloat16 constant to float for better accuracy, after this PR, the TIMM ```dm_nfnet``` model amp path can be passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105440
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-19 06:53:23 +00:00
XiaobingSuper
4b3c261a2e inductor: fix issue of vectorization when the store's index is constant value (#105314)
Fix #104515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105314
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-18 04:54:25 +00:00
lezcano
87a3ed58cb Fix ranges for range vars (#104987)
Ranges are inclusive on both ends...

We take this chance to delete a stale comment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104987
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-14 13:43:05 +00:00
Peter Bell
66fb83293e [inductor] Add min/max to index propagation pass (#105020)
This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing
into direct indexing expressions. I also add support to the cpp printer for
Min/Max and fix the triton printer to support multi-argument Min/Max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020
Approved by: https://github.com/lezcano
2023-07-12 19:03:01 +00:00
Peter Bell
e80787c8e1 [inductor] Split ops.reduction into reduction and store_reduction (#102737)
This is intended as a first step towards reductions with multiple outputs. This
also incidentally improves CSE of reductions under C++ codegen. For example,
```python
def fn(x):
    return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1)
```

Currently this generates two reductions, where the common load is CSEd
```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
    if (tmp_acc1.value > tmp0) {
        tmp_acc1.index = i1; tmp_acc1.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
auto tmp2 = tmp_acc1.index;
out_ptr1[static_cast<long>(i0)] = tmp2;
```

but with this change it gets CSEd to a single accumulator

```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
out_ptr1[static_cast<long>(i0)] = tmp1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-07-08 20:48:29 +00:00
Peter Bell
0ceca92f80 [inductor] Add single pass "var_unnormalized" reduction_type (#102486)
This is a bit inefficient because it computes the mean and throws it
away since ir.Reduction nodes only have 1 output. However, the mean
can at least be scheduled into the same loop as the variance now since
there is no data dependency. Thus we can take fewer passes over the
data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-08 20:48:29 +00:00
lezcano
710abc41cc Implement bound_sympy (#104559)
The analysis for SymPy expressions was incorrect as, even though it said
that the assumption was "smoothness" the assumption was, in fact, that he
formula was monotone in every variable. In other words, it was
assuming that the derivative does not change signs in any variable (!!).

We implement a function that, given bounds on the values of the free
symbols of a sympy expression, it gives a bound on a the expression
itself.

We reshuffle a few things in value_ranges.py to create a
`SymPyValueRangeAnalysis` class, but we do not change any code really.
The only relevant change in that file is the addition of the
`sympy_bound`s function. We do this because we don't want to inadvertently
use any fallbacks in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104559
Approved by: https://github.com/eellison
2023-07-07 23:52:14 +00:00
PyTorch MergeBot
8ca63ff9a8 Revert "[inductor] Add single pass "var_unnormalized" reduction_type (#102486)"
This reverts commit 7e098f9559.

Reverted https://github.com/pytorch/pytorch/pull/102486 on behalf of https://github.com/clee2000 due to sorry but this seems to have broken inductor/test_torchinductor.py::CpuTests::test_std_cpu on mac x86 64 machines 7e098f9559 https://github.com/pytorch/pytorch/actions/runs/5479008241/jobs/9981443710 ([comment](https://github.com/pytorch/pytorch/pull/102486#issuecomment-1624739465))
2023-07-07 04:57:20 +00:00
PyTorch MergeBot
1280b19827 Revert "[inductor] Split ops.reduction into reduction and store_reduction (#102737)"
This reverts commit 59b8d5be74.

Reverted https://github.com/pytorch/pytorch/pull/102737 on behalf of https://github.com/clee2000 due to sorry but i need to revert this to revert the other one in the stack ([comment](https://github.com/pytorch/pytorch/pull/102737#issuecomment-1624735108))
2023-07-07 04:53:14 +00:00
Brian Hirsh
2efe4d809f [hotfix inductor test] disable cpp vectorization codegen in fbcode for inductor (#104560)
Summary:
After D46364355 landed, a few inductor internal tests started failing. When I ran this locally:
```
buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config
```

The test appeared to hang with this output, until it would fail with a timeout after 10 minutes passed:
```
Test caffe2/test/inductor:config -- discovering tests [local_execute]
```

Eventually, I realized that inductor has a value `HAS_CPU` (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/testing/_internal/inductor_utils.py?lines=23) that is implemented lazily. Part of that implementation involves inspecting `/proc/cpuinfo` to figure out what vectorized intructions are available, and that call appeared to hang (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/_inductor/codecache.py?lines=568).

Since vectorized codegen for inductor cpu internally already isn't working, I hardcoded that test to fail for now in fbcode.

Test Plan:
Confirmed that this passes:
`buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config`

Differential Revision: D47199912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104560
Approved by: https://github.com/desertfire, https://github.com/bertmaher
2023-07-06 19:00:13 +00:00
XiaobingSuper
c4cf90aad1 inductor: fix assert error when load a bfloat16 inf constant (#104614)
Fix ```nanogpt_generate``` bfloat16 path error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104614
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-06 17:01:04 +00:00
Peter Bell
59b8d5be74 [inductor] Split ops.reduction into reduction and store_reduction (#102737)
This is intended as a first step towards reductions with multiple outputs. This
also incidentally improves CSE of reductions under C++ codegen. For example,
```python
def fn(x):
    return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1)
```

Currently this generates two reductions, where the common load is CSEd
```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
    if (tmp_acc1.value > tmp0) {
        tmp_acc1.index = i1; tmp_acc1.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
auto tmp2 = tmp_acc1.index;
out_ptr1[static_cast<long>(i0)] = tmp2;
```

but with this change it gets CSEd to a single accumulator

```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
out_ptr1[static_cast<long>(i0)] = tmp1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-07-06 16:22:19 +00:00
Peter Bell
7e098f9559 [inductor] Add single pass "var_unnormalized" reduction_type (#102486)
This is a bit inefficient because it computes the mean and throws it
away since ir.Reduction nodes only have 1 output. However, the mean
can at least be scheduled into the same loop as the variance now since
there is no data dependency. Thus we can take fewer passes over the
data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-06 00:00:59 +00:00
leslie-fang-intel
ea4d5c4538 [Quant][PT2E] Enable vec code gen for pair of quant/dequant (#104503)
**Summary**
We have supported the vectorization code gen with pattern of `dequant-relu-quant`, for which `to_uint8` is the last node of quant pattern before store into memory. However, there is another case that `dequant1-relu-quant2-dequant2-relu-quant3`. In this case, `quant2` is at the middle of fusion pattern, we enable vectorization code gen of `quant2-dequant2` in this PR.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py  -k test_dequant_relu_quant_dequant_relu_quant_lowering
```

**Next Step**
* For better performance, we can add another pass to eliminate pair nodes of `float_to_uint8` and `uint8_to_float`.
* For better performance, we should annotate `dequant1` and `quant2` as share observer in quantization recipe. Then we can lower `dequant1-relu-quant2` into a QReLU node to fully eliminate the calculation of `dequant1` and `quant2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104503
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-05 01:59:00 +00:00
lezcano
7ae100628e Move most SymPy functions to their own file (#104556)
All these are standalone implementations of some functions and they
don't depend on anything else, so we better have them under the
`_sympy/` folder on their own

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556
Approved by: https://github.com/ezyang
2023-07-04 03:53:48 +00:00
leslie-fang-intel
707d265db2 [Inductor][Quant]Refactor load and store vectorization code generation with uint8 data type (#104075)
**Summary**
Refactor the vectorization code generation of uint8 input data type. Previously, we combine the uint8 data load and uint8 to float data convert into one step as `load_uint8_as_float` and `store_float_as_uint8`. After refactor, we split them into 2 steps of load/store and data type convert to make the behavior same as BFloat16 data type .

The previous generated code is:
```
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(i0));
    auto tmp1 = (tmp0);
    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0));
    auto tmp3 = tmp1 - tmp2;
    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01));
    auto tmp5 = tmp3 * tmp4;
    auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
    auto tmp7 = tmp6 * tmp2;
    auto tmp8 = tmp7.round();
    auto tmp9 = tmp8 + tmp2;
    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0));
    auto tmp11 = at::vec::maximum(tmp9, tmp10);
    auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0));
    auto tmp13 = at::vec::minimum(tmp11, tmp12);
    auto tmp14 = (tmp13);
    at::vec::store_float_as_uint8(tmp14, out_ptr0 + static_cast<long>(i0));
}
```

After this PR, the generated code is:
```
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<uint8_t>::loadu(in_ptr0 + static_cast<long>(i0), 16);
    auto tmp1 = cvt_uint8_to_fp32_with_same_elem_num(tmp0);
    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0));
    auto tmp3 = tmp1 - tmp2;
    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01));
    auto tmp5 = tmp3 * tmp4;
    auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
    auto tmp7 = tmp6 * tmp2;
    auto tmp8 = tmp7.round();
    auto tmp9 = tmp8 + tmp2;
    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0));
    auto tmp11 = at::vec::maximum(tmp9, tmp10);
    auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0));
    auto tmp13 = at::vec::minimum(tmp11, tmp12);
    auto tmp14 = cvt_fp32_to_uint8(tmp13);
    tmp14.store(out_ptr0 + static_cast<long>(i0), 16);
}
```

**Test Plan**
```
python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant
python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant
python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104075
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-01 23:12:43 +00:00
Brian Hirsh
624d20c3de kill inductor.config.disable_cpp_codegen in internal (#104351)
Summary:
This diff adds a path in inductor to invoke gcc through Remote Execution, when run from within fbcode.

This should (hopefully) let us kill the `inductor.disable_cpp_codegen` flag, since we should now be able to invoke clang at runtime from within fbcode to compile c++ code. This was preventing https://github.com/pytorch/pytorch/pull/100115 from landing, which fixed one of the last remaining models in torchbench that was failing with `torch.compile` (hf_Longformer).

Enumeration of changes:

- updated inductor to invoke `_run_build_command()` when in fbcode, which hooks into Remote Execution
- When inductor invokes g++ normally, it includes a bunch of absolute paths, to stuff like the pytorch header paths, and the input and output path. I changed these all to relative paths when in fbcode, and copied everything we needed into a temp dir that we send to Remote Execution.
- updated `triton/fb/make_build_paths.py` to let us grab paths to openmp, sleef, and ld from within the Remote Execution environment. I'm not sure if there's a better way to do this (but this way appeared to work, thanks to Bert's suggestion from https://www.internalfb.com/diff/D46482550?dst_version_fbid=231706286239076&transaction_fbid=229345569847706)
- factored `triton/fb/build.py` (it had a function to create a triton build command and run it all in one go, I separated the bit that takes in an arbitrary command (our clang command), and runs it with RE)
- a few tweaks to the include paths that inductor uses: it adds those two extra paths (sleef and openmp), and it also does not manually include the `-ltorch`,`-lc10`,`-ltorch_python`,`-ltorch_cpu` libs - the linker was complaining that it couldn't find those libs, and not including those flags ends up working
- I added a few more missing headers. Maybe with D46527002 this won't be necessary?
- I had a basic manual test in `scripts/hirsheybar/tmp2.py`. We probably want to try running an actual job in MAST to make sure this works.

Test Plan: `scripts/hirsheybar/pt2/tmp2.py` has a basic test, but I'm also planning on testing by kicking off a MAST job with cmf_10x (thanks to a bunch of help from Bert)

Reviewed By: bertmaher

Differential Revision: D46364355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104351
Approved by: https://github.com/bertmaher
2023-06-30 13:32:16 +00:00
XiaobingSuper
a704251628 inductor: fix compile error of bfloat16 broadcast operation (#104319)
For the bfloat16 broadcast, there is always has compile error:
```
error: could not convert ‘tmp2’ from ‘Vectorized<float>’ to ‘Vectorized<c10::BFloat16>
```

This PR will fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104319
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-30 04:14:38 +00:00
leslie-fang-intel
f8ac569365 [Inductor][Quant]Fix tile2d code generation issue with uint8 data type (#104074)
**Summary**
The previous vectorized code generation of tile2d doesn't support input data type of uint8, which still takes it as float and generate wrong result. This PR fixes this issue. Take UT `test_tile2d_load_decomposed_dequant_add_relu_quant` in this PR as example:
The previous generated code is:
```
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L))
{
    unsigned char tmp0[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp0, 16);
    unsigned char tmp7[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp7, 16);
    for (long i0_inner = 0; i0_inner < 16; i0_inner++)
    {
        auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*i0_inner));
        auto tmp8 = at::vec::Vectorized<float>::loadu(tmp7 + static_cast<long>(16L*i0_inner));
        auto tmp2 = (tmp1);
        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0));
        auto tmp4 = tmp2 - tmp3;
        auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01));
        auto tmp6 = tmp4 * tmp5;
        auto tmp9 = (tmp8);
        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0));
        auto tmp11 = tmp9 - tmp10;
        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02));
        auto tmp13 = tmp11 * tmp12;
        auto tmp14 = tmp6 + tmp13;
        auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0));
        auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336));
        auto tmp17 = tmp15 * tmp16;
        auto tmp18 = tmp17.round();
        auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0));
        auto tmp20 = tmp18 + tmp19;
        auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0));
        auto tmp22 = at::vec::maximum(tmp20, tmp21);
        auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0));
        auto tmp24 = at::vec::minimum(tmp22, tmp23);
        auto tmp25 = (tmp24);
        at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196L*i0) + (196L*i0_inner)));
    }
}
```

After this PR, the generated code is:
```
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L))
{
    unsigned char tmp0[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp0, 16);
    unsigned char tmp7[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp7, 16);
    for (long i0_inner = 0; i0_inner < 16; i0_inner++)
    {
        auto tmp1 = at::vec::load_uint8_as_float(tmp0 + static_cast<long>(16L*i0_inner));
        auto tmp8 = at::vec::load_uint8_as_float(tmp7 + static_cast<long>(16L*i0_inner));
        auto tmp2 = (tmp1);
        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0));
        auto tmp4 = tmp2 - tmp3;
        auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01));
        auto tmp6 = tmp4 * tmp5;
        auto tmp9 = (tmp8);
        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0));
        auto tmp11 = tmp9 - tmp10;
        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02));
        auto tmp13 = tmp11 * tmp12;
        auto tmp14 = tmp6 + tmp13;
        auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0));
        auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336));
        auto tmp17 = tmp15 * tmp16;
        auto tmp18 = tmp17.round();
        auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0));
        auto tmp20 = tmp18 + tmp19;
        auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0));
        auto tmp22 = at::vec::maximum(tmp20, tmp21);
        auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0));
        auto tmp24 = at::vec::minimum(tmp22, tmp23);
        auto tmp25 = (tmp24);
        at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196L*i0) + (196L*i0_inner)));
    }
}
```

**Test Plan**
```
python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant
python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104074
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-27 00:59:05 +00:00
Antoni Viros i Martin
0d653730ce Refactory bits for the codegen cache (#103452)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103452
Approved by: https://github.com/ezyang
2023-06-22 13:04:22 +00:00