pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Wang, Eikan 6541e51ffd Explicit vectorization support for TorchInductor (#87068 ) In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8. ```C++ extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, const long ks0, const long ks1) { #pragma omp parallel num_threads(48) { #pragma omp for for(long i0=0; i0<((ks0ks1) / 8); ++i0) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8i0); auto tmp2 = tmp0 + tmp1; auto tmp3 = tmp2.exp(); tmp3.store(out_ptr0 + 8i0); } #pragma omp for simd simdlen(4) for(long i0=8(((ks0ks1) / 8)); i0<ks0*ks1; ++i0) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 + tmp1; auto tmp3 = std::exp(tmp2); out_ptr0[i0] = tmp3; } } } ``` The major pipeline is as follows. - Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](`bf66991fc4/torch/_inductor/codegen/cpp.py (L702)`)is to check whether all the `ops` have been supported. The [other one](`355326faa3/torch/_inductor/codegen/cpp.py (L672)`) is to check whether the data access could be vectorized. - [`CppSimdVecKernelChecker`](`355326faa3/torch/_inductor/codegen/cpp.py (L655)`) - Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized. - [`CppSimdVecKernel`](`355326faa3/torch/_inductor/codegen/cpp.py (L601)`) - [`CppSimdVecOverrides`](`355326faa3/torch/_inductor/codegen/cpp.py (L159)`): The ops that we have supported on the top of `aten::vec` - Create kernel - [`aten::vec` kernel](`355326faa3/torch/_inductor/codegen/cpp.py (L924)`) - [`Original CPP kernel - OMP SIMD`](`355326faa3/torch/_inductor/codegen/cpp.py (L929)`) - Generate code - [`CppKernelProxy`](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) is used to combine the `aten::vec` kernel and original cpp kernel - [Vectorize the most inner loop](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) - [Generate code](`355326faa3/torch/_inductor/codegen/cpp.py (L821)`) Next steps: - [x] Support reduction - [x] Vectorize the tail loop with `aten::vec` - [ ] Support BF16 - [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068 Approved by: https://github.com/jgong5, https://github.com/jansel		2022-11-07 06:24:14 +00:00
..
__init__.py
autotuner.py
common.py	Unified symbolic shape variables between Inductor and AOTDispatcher (#87161 )	2022-10-19 04:50:34 +00:00
cpp_prefix.h	Explicit vectorization support for TorchInductor (#87068 )	2022-11-07 06:24:14 +00:00
cpp.py	Explicit vectorization support for TorchInductor (#87068 )	2022-11-07 06:24:14 +00:00
triton_conv_delta_x_hwc.j2
triton_conv_delta_x.j2
triton_mm.j2
triton_template.py
triton.py	reduce the number of autotuning iterations, don't autotune simple til… (#88386 )	2022-11-03 15:58:18 +00:00
wrapper.py	[inductor] Move size asserts to C++, fix bug (#87028 )	2022-10-16 20:17:22 +00:00