pytorch/torch/_inductor/codegen
Wang, Eikan 6541e51ffd Explicit vectorization support for TorchInductor (#87068)
In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8.

```C++
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       const long ks0,
                       const long ks1)
{
    #pragma omp parallel num_threads(48)
    {
        #pragma omp for
        for(long i0=0; i0<((ks0*ks1) / 8); ++i0)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = tmp2.exp();
            tmp3.store(out_ptr0 + 8*i0);
        }
        #pragma omp for simd simdlen(4)
        for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0)
        {
            auto tmp0 = in_ptr0[i0];
            auto tmp1 = in_ptr1[i0];
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = std::exp(tmp2);
            out_ptr0[i0] = tmp3;
        }
    }
}

```

The major pipeline is as follows.
- Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](bf66991fc4/torch/_inductor/codegen/cpp.py (L702))is to check whether all the `ops` have been supported. The [other one](355326faa3/torch/_inductor/codegen/cpp.py (L672)) is to check whether the data access could be vectorized.
  - [`CppSimdVecKernelChecker`](355326faa3/torch/_inductor/codegen/cpp.py (L655))
- Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized.
  - [`CppSimdVecKernel`](355326faa3/torch/_inductor/codegen/cpp.py (L601))
  - [`CppSimdVecOverrides`](355326faa3/torch/_inductor/codegen/cpp.py (L159)): The ops that we have supported on the top of `aten::vec`
  - Create kernel
    - [`aten::vec` kernel](355326faa3/torch/_inductor/codegen/cpp.py (L924))
    - [`Original CPP kernel - OMP SIMD`](355326faa3/torch/_inductor/codegen/cpp.py (L929))
- Generate code
  - [`CppKernelProxy`](355326faa3/torch/_inductor/codegen/cpp.py (L753)) is used to combine the `aten::vec` kernel and original cpp kernel
    - [Vectorize the most inner loop](355326faa3/torch/_inductor/codegen/cpp.py (L753))
    - [Generate code](355326faa3/torch/_inductor/codegen/cpp.py (L821))

Next steps:
- [x] Support reduction
- [x] Vectorize the tail loop with `aten::vec`
- [ ] Support BF16
- [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:24:14 +00:00
..
__init__.py
autotuner.py
common.py Unified symbolic shape variables between Inductor and AOTDispatcher (#87161) 2022-10-19 04:50:34 +00:00
cpp_prefix.h Explicit vectorization support for TorchInductor (#87068) 2022-11-07 06:24:14 +00:00
cpp.py Explicit vectorization support for TorchInductor (#87068) 2022-11-07 06:24:14 +00:00
triton_conv_delta_x_hwc.j2
triton_conv_delta_x.j2
triton_mm.j2
triton_template.py
triton.py reduce the number of autotuning iterations, don't autotune simple til… (#88386) 2022-11-03 15:58:18 +00:00
wrapper.py [inductor] Move size asserts to C++, fix bug (#87028) 2022-10-16 20:17:22 +00:00