pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
lezcano	2b6249e209	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-23 11:59:20 +00:00
XiaobingSuper	610f64d72a	inductor: also check index_exp when select tiling var (#106765 ) For select tiling var, currently, we only consider load and store which do not consider index exp, and meet accuracy issues: before(the index exp ```i1-1``` can not be vectrized): ``` cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + i1)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp2 = to_float_mask(tmp0 >= tmp1); auto tmp3 = [&] { auto tmp4 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (8Li1_inner) + (25088Li0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })(); auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))); auto tmp6 = tmp4 tmp5; return tmp6; } ; auto tmp7 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2)); { __at_align__ float tmpbuf[16sizeof(float)/sizeof(float)]; tmp7.store(tmpbuf); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>(i2 + (8Li1) + (8Li1_inner) + (25096Li0))] = tmpbuf[i1_inner]; } } } #pragma GCC ivdep for(long i1=static_cast<long>(3136L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>((-1L) + i1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (25088Li0))]; auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))]; auto tmp6 = decltype(tmp4)(tmp4 * tmp5); return tmp6; } ; auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); out_ptr0[static_cast<long>(i2 + (8Li1) + (25096Li0))] = tmp7; } } } } } } ``` after: ``` cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L)) { #pragma omp simd simdlen(8) for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>((-1L) + i1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (25088Li0))]; auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))]; auto tmp6 = decltype(tmp4)(tmp4 * tmp5); return tmp6; } ; auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); out_ptr0[static_cast<long>(i2 + (8Li1) + (25096Li0))] = tmp7; } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106765 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-23 07:16:14 +00:00
PyTorch MergeBot	b282787409	Revert "Wrap indirect indexing on CUDA (#105055 )" This reverts commit `85c673e6b2`. Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))	2023-08-22 20:24:41 +00:00
lezcano	85c673e6b2	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-22 01:06:35 +00:00
Peter Bell	18b1c2907d	[inductor] Add ir.WelfordReduction with multiple outputs (#104725 ) This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously. Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725 Approved by: https://github.com/lezcano	2023-08-18 08:18:01 +00:00
Wang, Eikan	9921b48558	Extend Inductor to support the third-party backend (#106874 ) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. `c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)` In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about 57~58 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about 46~47 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about 47~48 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel	2023-08-16 04:11:36 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit `05bd24bb35`. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00
haozhe.zhu	60237ccbdf	fix bf16 constant accuracy (#105827 ) This PR aims to sort out the data type for `constant`. The constant should be promoted to float https://github.com/pytorch/pytorch/pull/105440. So there are serval changes to do: - Data type propagation should propagate constant node to `float` dtype if original dtype is `bfloat16` - We do not need to insert `to_dtype` after the `constant` node, directly init an `fp32` constant is faster. ``` vectorized<bfloat16> tmp(value); vectorized <float> tmp1 = cvt_bf16_fp32(tmp); -> vectorized<float> tmp(value); ``` - move `constant` out of the list for `all operations can support bf16 without converting to fp32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105827 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-03 01:17:50 +00:00
Wang, Eikan	05bd24bb35	Extend Inductor to support the third-party backend (#100706 ) This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done. Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code. - Python wrapper code generation Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions. - Kernel code generation It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions. - [group_fn](`71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64)`) - [flush](`71c4becda7/torch/_inductor/scheduler.py (L1150)`) - [can_fuse_vertical](`71c4becda7/torch/_inductor/scheduler.py (L1006)`) - [can_fuse_horizontal](`71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64)`) - [codegen_template](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._ - [codegen_nodes](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) - [codegen_sync](`71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)`). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._ The third-party backend needs to inherit from the `Scheduling` class and implement these functions. Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706 Approved by: https://github.com/jansel	2023-08-02 05:13:51 +00:00
haozhe.zhu	952021934f	inductor: legalize fp16 (#100857 ) This PR aims to vectorize FP16 for CPU with what BF16 has done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100857 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-27 02:31:40 +00:00
PyTorch MergeBot	dfc9874740	Revert "inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440 )" This reverts commit `18bcf62bbc`. Reverted https://github.com/pytorch/pytorch/pull/105440 on behalf of https://github.com/XiaobingSuper due to introduce core dumped when init bfloat16 zero tensor ([comment](https://github.com/pytorch/pytorch/pull/105440#issuecomment-1643079005))	2023-07-20 03:56:44 +00:00
Justin Chu	cb7a30f656	[BE] Enable ruff's UP rules and autoformat inductor/ (#105431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431 Approved by: https://github.com/albanD	2023-07-19 13:45:00 +00:00
XiaobingSuper	18bcf62bbc	inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440 ) As scalar path, we should also promote half/bfloat16 constant to float for better accuracy, after this PR, the TIMM ```dm_nfnet``` model amp path can be passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105440 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-19 06:53:23 +00:00
XiaobingSuper	4b3c261a2e	inductor: fix issue of vectorization when the store's index is constant value (#105314 ) Fix #104515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105314 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-18 04:54:25 +00:00
lezcano	87a3ed58cb	Fix ranges for range vars (#104987 ) Ranges are inclusive on both ends... We take this chance to delete a stale comment Pull Request resolved: https://github.com/pytorch/pytorch/pull/104987 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-14 13:43:05 +00:00
Peter Bell	66fb83293e	[inductor] Add min/max to index propagation pass (#105020 ) This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing into direct indexing expressions. I also add support to the cpp printer for Min/Max and fix the triton printer to support multi-argument Min/Max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020 Approved by: https://github.com/lezcano	2023-07-12 19:03:01 +00:00
Peter Bell	e80787c8e1	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-08 20:48:29 +00:00
Peter Bell	0ceca92f80	[inductor] Add single pass "var_unnormalized" reduction_type (#102486 ) This is a bit inefficient because it computes the mean and throws it away since ir.Reduction nodes only have 1 output. However, the mean can at least be scheduled into the same loop as the variance now since there is no data dependency. Thus we can take fewer passes over the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-08 20:48:29 +00:00
lezcano	710abc41cc	Implement bound_sympy (#104559 ) The analysis for SymPy expressions was incorrect as, even though it said that the assumption was "smoothness" the assumption was, in fact, that he formula was monotone in every variable. In other words, it was assuming that the derivative does not change signs in any variable (!!). We implement a function that, given bounds on the values of the free symbols of a sympy expression, it gives a bound on a the expression itself. We reshuffle a few things in value_ranges.py to create a `SymPyValueRangeAnalysis` class, but we do not change any code really. The only relevant change in that file is the addition of the `sympy_bound`s function. We do this because we don't want to inadvertently use any fallbacks in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104559 Approved by: https://github.com/eellison	2023-07-07 23:52:14 +00:00
PyTorch MergeBot	8ca63ff9a8	Revert "[inductor] Add single pass "var_unnormalized" reduction_type (#102486 )" This reverts commit `7e098f9559`. Reverted https://github.com/pytorch/pytorch/pull/102486 on behalf of https://github.com/clee2000 due to sorry but this seems to have broken inductor/test_torchinductor.py::CpuTests::test_std_cpu on mac x86 64 machines `7e098f9559` https://github.com/pytorch/pytorch/actions/runs/5479008241/jobs/9981443710 ([comment](https://github.com/pytorch/pytorch/pull/102486#issuecomment-1624739465))	2023-07-07 04:57:20 +00:00
PyTorch MergeBot	1280b19827	Revert "[inductor] Split ops.reduction into reduction and store_reduction (#102737 )" This reverts commit `59b8d5be74`. Reverted https://github.com/pytorch/pytorch/pull/102737 on behalf of https://github.com/clee2000 due to sorry but i need to revert this to revert the other one in the stack ([comment](https://github.com/pytorch/pytorch/pull/102737#issuecomment-1624735108))	2023-07-07 04:53:14 +00:00
Brian Hirsh	2efe4d809f	[hotfix inductor test] disable cpp vectorization codegen in fbcode for inductor (#104560 ) Summary: After D46364355 landed, a few inductor internal tests started failing. When I ran this locally: ``` buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config ``` The test appeared to hang with this output, until it would fail with a timeout after 10 minutes passed: ``` Test caffe2/test/inductor:config -- discovering tests [local_execute] ``` Eventually, I realized that inductor has a value `HAS_CPU` (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/testing/_internal/inductor_utils.py?lines=23) that is implemented lazily. Part of that implementation involves inspecting `/proc/cpuinfo` to figure out what vectorized intructions are available, and that call appeared to hang (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/_inductor/codecache.py?lines=568). Since vectorized codegen for inductor cpu internally already isn't working, I hardcoded that test to fail for now in fbcode. Test Plan: Confirmed that this passes: `buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config` Differential Revision: D47199912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104560 Approved by: https://github.com/desertfire, https://github.com/bertmaher	2023-07-06 19:00:13 +00:00
XiaobingSuper	c4cf90aad1	inductor: fix assert error when load a bfloat16 inf constant (#104614 ) Fix ```nanogpt_generate``` bfloat16 path error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104614 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-06 17:01:04 +00:00
Peter Bell	59b8d5be74	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-06 16:22:19 +00:00
Peter Bell	7e098f9559	[inductor] Add single pass "var_unnormalized" reduction_type (#102486 ) This is a bit inefficient because it computes the mean and throws it away since ir.Reduction nodes only have 1 output. However, the mean can at least be scheduled into the same loop as the variance now since there is no data dependency. Thus we can take fewer passes over the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-06 00:00:59 +00:00
leslie-fang-intel	ea4d5c4538	[Quant][PT2E] Enable vec code gen for pair of quant/dequant (#104503 ) Summary We have supported the vectorization code gen with pattern of `dequant-relu-quant`, for which `to_uint8` is the last node of quant pattern before store into memory. However, there is another case that `dequant1-relu-quant2-dequant2-relu-quant3`. In this case, `quant2` is at the middle of fusion pattern, we enable vectorization code gen of `quant2-dequant2` in this PR. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_relu_quant_dequant_relu_quant_lowering ``` Next Step * For better performance, we can add another pass to eliminate pair nodes of `float_to_uint8` and `uint8_to_float`. * For better performance, we should annotate `dequant1` and `quant2` as share observer in quantization recipe. Then we can lower `dequant1-relu-quant2` into a QReLU node to fully eliminate the calculation of `dequant1` and `quant2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104503 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-05 01:59:00 +00:00
lezcano	7ae100628e	Move most SymPy functions to their own file (#104556 ) All these are standalone implementations of some functions and they don't depend on anything else, so we better have them under the `_sympy/` folder on their own Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556 Approved by: https://github.com/ezyang	2023-07-04 03:53:48 +00:00
leslie-fang-intel	707d265db2	[Inductor][Quant]Refactor load and store vectorization code generation with uint8 data type (#104075 ) Summary Refactor the vectorization code generation of uint8 input data type. Previously, we combine the uint8 data load and uint8 to float data convert into one step as `load_uint8_as_float` and `store_float_as_uint8`. After refactor, we split them into 2 steps of load/store and data type convert to make the behavior same as BFloat16 data type . The previous generated code is: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 * tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = tmp6 * tmp2; auto tmp8 = tmp7.round(); auto tmp9 = tmp8 + tmp2; auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp11 = at::vec::maximum(tmp9, tmp10); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp13 = at::vec::minimum(tmp11, tmp12); auto tmp14 = (tmp13); at::vec::store_float_as_uint8(tmp14, out_ptr0 + static_cast<long>(i0)); } ``` After this PR, the generated code is: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<uint8_t>::loadu(in_ptr0 + static_cast<long>(i0), 16); auto tmp1 = cvt_uint8_to_fp32_with_same_elem_num(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 * tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = tmp6 * tmp2; auto tmp8 = tmp7.round(); auto tmp9 = tmp8 + tmp2; auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp11 = at::vec::maximum(tmp9, tmp10); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp13 = at::vec::minimum(tmp11, tmp12); auto tmp14 = cvt_fp32_to_uint8(tmp13); tmp14.store(out_ptr0 + static_cast<long>(i0), 16); } ``` Test Plan ``` python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104075 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-01 23:12:43 +00:00
Brian Hirsh	624d20c3de	kill inductor.config.disable_cpp_codegen in internal (#104351 ) Summary: This diff adds a path in inductor to invoke gcc through Remote Execution, when run from within fbcode. This should (hopefully) let us kill the `inductor.disable_cpp_codegen` flag, since we should now be able to invoke clang at runtime from within fbcode to compile c++ code. This was preventing https://github.com/pytorch/pytorch/pull/100115 from landing, which fixed one of the last remaining models in torchbench that was failing with `torch.compile` (hf_Longformer). Enumeration of changes: - updated inductor to invoke `_run_build_command()` when in fbcode, which hooks into Remote Execution - When inductor invokes g++ normally, it includes a bunch of absolute paths, to stuff like the pytorch header paths, and the input and output path. I changed these all to relative paths when in fbcode, and copied everything we needed into a temp dir that we send to Remote Execution. - updated `triton/fb/make_build_paths.py` to let us grab paths to openmp, sleef, and ld from within the Remote Execution environment. I'm not sure if there's a better way to do this (but this way appeared to work, thanks to Bert's suggestion from https://www.internalfb.com/diff/D46482550?dst_version_fbid=231706286239076&transaction_fbid=229345569847706) - factored `triton/fb/build.py` (it had a function to create a triton build command and run it all in one go, I separated the bit that takes in an arbitrary command (our clang command), and runs it with RE) - a few tweaks to the include paths that inductor uses: it adds those two extra paths (sleef and openmp), and it also does not manually include the `-ltorch`,`-lc10`,`-ltorch_python`,`-ltorch_cpu` libs - the linker was complaining that it couldn't find those libs, and not including those flags ends up working - I added a few more missing headers. Maybe with D46527002 this won't be necessary? - I had a basic manual test in `scripts/hirsheybar/tmp2.py`. We probably want to try running an actual job in MAST to make sure this works. Test Plan: `scripts/hirsheybar/pt2/tmp2.py` has a basic test, but I'm also planning on testing by kicking off a MAST job with cmf_10x (thanks to a bunch of help from Bert) Reviewed By: bertmaher Differential Revision: D46364355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104351 Approved by: https://github.com/bertmaher	2023-06-30 13:32:16 +00:00
XiaobingSuper	a704251628	inductor: fix compile error of bfloat16 broadcast operation (#104319 ) For the bfloat16 broadcast, there is always has compile error: ``` error: could not convert ‘tmp2’ from ‘Vectorized<float>’ to ‘Vectorized<c10::BFloat16> ``` This PR will fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104319 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-30 04:14:38 +00:00
leslie-fang-intel	f8ac569365	[Inductor][Quant]Fix tile2d code generation issue with uint8 data type (#104074 ) Summary The previous vectorized code generation of tile2d doesn't support input data type of uint8, which still takes it as float and generate wrong result. This PR fixes this issue. Take UT `test_tile2d_load_decomposed_dequant_add_relu_quant` in this PR as example: The previous generated code is: ``` #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L)) { unsigned char tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp0, 16); unsigned char tmp7[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp7, 16); for (long i0_inner = 0; i0_inner < 16; i0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Li0_inner)); auto tmp8 = at::vec::Vectorized<float>::loadu(tmp7 + static_cast<long>(16Li0_inner)); auto tmp2 = (tmp1); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp4 = tmp2 - tmp3; auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp6 = tmp4 * tmp5; auto tmp9 = (tmp8); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0)); auto tmp11 = tmp9 - tmp10; auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02)); auto tmp13 = tmp11 * tmp12; auto tmp14 = tmp6 + tmp13; auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0)); auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336)); auto tmp17 = tmp15 * tmp16; auto tmp18 = tmp17.round(); auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0)); auto tmp20 = tmp18 + tmp19; auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp22 = at::vec::maximum(tmp20, tmp21); auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp24 = at::vec::minimum(tmp22, tmp23); auto tmp25 = (tmp24); at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196Li0) + (196Li0_inner))); } } ``` After this PR, the generated code is: ``` #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L)) { unsigned char tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp0, 16); unsigned char tmp7[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp7, 16); for (long i0_inner = 0; i0_inner < 16; i0_inner++) { auto tmp1 = at::vec::load_uint8_as_float(tmp0 + static_cast<long>(16Li0_inner)); auto tmp8 = at::vec::load_uint8_as_float(tmp7 + static_cast<long>(16Li0_inner)); auto tmp2 = (tmp1); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp4 = tmp2 - tmp3; auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp6 = tmp4 * tmp5; auto tmp9 = (tmp8); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0)); auto tmp11 = tmp9 - tmp10; auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02)); auto tmp13 = tmp11 * tmp12; auto tmp14 = tmp6 + tmp13; auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0)); auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336)); auto tmp17 = tmp15 * tmp16; auto tmp18 = tmp17.round(); auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0)); auto tmp20 = tmp18 + tmp19; auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp22 = at::vec::maximum(tmp20, tmp21); auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp24 = at::vec::minimum(tmp22, tmp23); auto tmp25 = (tmp24); at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196Li0) + (196Li0_inner))); } } ``` Test Plan ``` python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104074 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-27 00:59:05 +00:00
Antoni Viros i Martin	0d653730ce	Refactory bits for the codegen cache (#103452 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103452 Approved by: https://github.com/ezyang	2023-06-22 13:04:22 +00:00
XiaobingSuper	01abccf63f	inductor: fix CppTile2D bf16 store complier error for cpp backend (#103659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103659 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-19 00:46:30 +00:00
XiaobingSuper	b287cb816c	inductor: make the vec_transpose's tiling stride doesn't depend on out_idx and tiling_idex (#103651 ) For TIMM swin_base_patch4_window7_224 dynamic shape path, there has an accuracy issue with horizontal reduction with vec_transpose: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L)) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={{0}}) float tmp_acc0 = 0; auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(16L)) { float tmp1[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr1 + static_cast<long>(i2 + (128L(static_cast<long>((static_cast<long>(i1) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L(static_cast<long>(at::native::div_floor_integer(i1, 56L)) % static_cast<long>(7L))) + (6272L(at::native::div_floor_integer((static_cast<long>(i1) % static_cast<long>(56L)), 7L))) + (50176L(at::native::div_floor_integer(i1, 392L))) + (401408Li0)), static_cast<long>(((-50176L)(at::native::div_floor_integer(i1, 392L))) + ((-6272L)(at::native::div_floor_integer((static_cast<long>(i1) % static_cast<long>(56L)), 7L))) + ((-896L)(static_cast<long>(at::native::div_floor_integer(i1, 56L)) % static_cast<long>(7L))) + ((-128L)(static_cast<long>((static_cast<long>(i1) % static_cast<long>(56L))) % static_cast<long>(7L))) + (128L(static_cast<long>((static_cast<long>((1L + i1)) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L(static_cast<long>(at::native::div_floor_integer((1L + i1), 56L)) % static_cast<long>(7L))) + (6272L(at::native::div_floor_integer((static_cast<long>((1L + i1)) % static_cast<long>(56L)), 7L))) + (50176L(at::native::div_floor_integer((1L + i1), 392L)))), tmp1, 16); for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (3136Li2) + (3136Li2_inner) + (401408Li0))); auto tmp2 = at::vec::Vectorized<float>::loadu(tmp1 + static_cast<long>(16Li2_inner)); auto tmp3 = tmp0 + tmp2; tmp_acc0_vec = tmp_acc0_vec + tmp3; } } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (3136Li0))); } } } ``` The ```transpose_mxn```'s ```ld_src``` depends on ```i1``` which is not expected. This PR will add a check to make sure the tiling stride doesn't depend on out_idx(```i2```) and tiling_idex(```i1```) After this PR, the generated code will be like this: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L)) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={{0}}) float tmp_acc0 = 0; auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(16L)) { for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (3136Li2) + (3136Li2_inner) + (401408Li0))); auto tmp1 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr1[static_cast<long>(i2 + i2_inner + (128L(static_cast<long>((static_cast<long>((i1 + i1_inner)) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L(static_cast<long>(at::native::div_floor_integer((i1 + i1_inner), 56L)) % static_cast<long>(7L))) + (6272L(at::native::div_floor_integer((static_cast<long>((i1 + i1_inner)) % static_cast<long>(56L)), 7L))) + (50176L(at::native::div_floor_integer((i1 + i1_inner), 392L))) + (401408Li0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })(); auto tmp2 = tmp0 + tmp1; tmp_acc0_vec = tmp_acc0_vec + tmp2; } } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (3136Li0))); } } } ``` How to reproduce this issue: ``` python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --accuracy --float32 -dcpu --inference -n5 --inductor --dynamic-shapes --only swin_base_patch4_window7_224 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103651 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-16 03:56:39 +00:00
XiaobingSuper	da21273ad5	inductor: support rsqrt for dynamic shape (#103579 ) Fix compiler error for HF hf_BigBird dynamic shape path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103579 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-15 07:02:18 +00:00
Nikita Shulga	5c252f2c7c	[Inductor/cpp] Fix reduction on pre clang-10 (#103347 ) `#pragma omp declare reduction` is not supported before clang-10 and results in a misleading compiler error in the following example: ```c++ template<typename T> T max_propagate_nan(T, T); extern "C" void cpp_fused_argmax_max_sum_0(const float* in_ptr0, float* out_ptr0, float* out_ptr1, long* out_ptr2) { float tmp_acc0 = 0; float tmp_acc1 = -std::numeric_limits<float>::infinity(); float tmp_acc2 = std::numeric_limits<float>::infinity(); struct IndexValue_7 {size_t index; float value;}; IndexValue_7 tmp_acc3{0, -std::numeric_limits<float>::infinity()}; #pragma omp declare reduction(argmax : IndexValue_7 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) for(long i0=static_cast<long>(0L); i0<static_cast<long>(3L); i0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i0)]; tmp_acc0 = tmp_acc0 + tmp0; tmp_acc1 = max_propagate_nan(tmp_acc1, tmp0); if (tmp_acc3.value < tmp0) { tmp_acc3.index = i0; tmp_acc3.value = tmp0; } } out_ptr0[static_cast<long>(0L)] = tmp_acc0; out_ptr1[static_cast<long>(0L)] = tmp_acc1; out_ptr2[static_cast<long>(0L)] = tmp_acc3.index; } ``` ``` % clang++-10 -std=c++17 -fopenmp bar.cpp -c -O3 % clang++-9 -std=c++17 -fopenmp bar.cpp -c -O3 bar.cpp:17:149: error: expected ')' #pragma omp declare reduction(argmax : IndexValue_7 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ^ bar.cpp:17:34: note: to match this '(' #pragma omp declare reduction(argmax : IndexValue_7 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ^ 1 error generated. ``` Also, remove unnecessary `struct` keyword in front of type, as C++ compiler already assumes that (and again, it causes problem with clang++-10 implementation) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103347 Approved by: https://github.com/voznesenskym	2023-06-10 02:53:37 +00:00
David Berard	cde4657284	[inductor] Support complex fallback for convert_element_type, _fft_c2c, view_as_real to support GoogleFnet with cpp wrapper (#103183 ) Fixes #102752 These 3 fallback kernels appear in GoogleFnet because they take complex arguments - i.e., usually they aren't fallback kernels. To support this model, we added support for these 3 ops. Details: 1. Add these 3 ops to the allowlist. I assume that we eventually want to support all fallback kernels, but for now we just add these 3 ops to the allowlist. 2. Support complex64 in cpp codegen 3. Support List[] arguments and ScalarType arguments in cpp codegen 4. Allow alias_info in schema arguments. In the original PR supporting fallback kernels for cpp wrapper, ops with schemas with non-null alias_info for any of the arguments were disallowed; but I don't think there's any reason we need to disallow these in cpp wrapper code. Caveats: * This has not added support for complex32 or complex128 * It only works with static shapes, not dynamic shapes. It seems like the dynamic shapes issue is unrelated to cpp wrapper, since it fails in the test_torchinductor_dynamic_shapes.py test. I checked these `test_fft_.` tests, which I added in this PR, and verified that they were broken with dynamic shapes before any of the code changes from this PR. Test*: ``` benchmarks/dynamo/huggingface.py --inductor --amp --accuracy --inference --device cuda --cpp-wrapper --only GoogleFnet ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103183 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w	2023-06-09 21:12:41 +00:00
XiaobingSuper	8e5b7ce5db	inductor: fix bf16 legalization issue for fp32 load with to bf16 case (#103080 ) Giving following ir: ``` def body(self, ops): get_index = self.get_index('index0') index_expr = ops.index_expr(get_index, torch.int32) constant = ops.constant(4, torch.int32) lt = ops.lt(index_expr, constant) masked_subblock1 = self.masked_subblock1(lt, 0.0) get_index_1 = self.get_index('index3') load = ops.load('arg2_1', get_index_1) to_dtype = ops.to_dtype(load, torch.bfloat16) where = ops.where(lt, masked_subblock1, to_dtype) get_index_2 = self.get_index('index3') store = ops.store('buf0', get_index_2, where, None) return store def masked_subblock2(self, ops): get_index = self.get_index('index2') load = ops.load('arg1_1', get_index) return load def masked_subblock1(self, ops): get_index = self.get_index('index1') index_expr = ops.index_expr(get_index, torch.int32) constant = ops.constant(1, torch.int32) ge = ops.ge(index_expr, constant) get_index_1 = self.get_index('index1') index_expr_1 = ops.index_expr(get_index_1, torch.int32) constant_1 = ops.constant(3, torch.int32) lt = ops.lt(index_expr_1, constant_1) and_ = ops.and_(ge, lt) masked_subblock2 = self.masked_subblock2(and_, 0.0) get_index_2 = self.get_index('index3') load = ops.load('arg2_1', get_index_2) to_dtype = ops.to_dtype(load, torch.bfloat16) where = ops.where(and_, masked_subblock2, to_dtype) return where ``` before this PR, the ```masked_subblock2``` will legalize as ```load_bf16+to_fp32```, and the ```masked_subblock2```'s output type is ```fp32```, but for ```load = ops.load('arg2_1', get_index_2), to_dtype = ops.to_dtype(load, torch.bfloat16)```, we didn't convert ```to_bf16``` as ```to_fp32```, which the ```op.where``` has mixed type computation, and will has compiler error: ```error: operands to ?: have different types ‘float’ and ‘c10::BFloat16’```. This PR will always convert ```to_bf16``` as ```to_fp32``` to fix such an issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103080 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-06-09 00:33:10 +00:00
Yanbo Liang	686d7e4c48	[Inductor] Fix x.view(dtype) decomp and make inductor support it (#102920 ) Fixes #99804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102920 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-06-07 17:10:54 +00:00
haozhe.zhu	adcefcb378	insert to dtype for fused mem copy scheduler node (#101042 ) Fix https://github.com/pytorch/pytorch/issues/100830. For the inplace node, there will be a `copy_` generated and the `copy_` will be `realized` as a `scheduler buffer` since it is a mutation. This `scheduler buffer` is a memory copy but after fusing with the previous buffer, it will not be a memory copy only buffers. This PR solves the issue by removing `load_bf16_as_fp32` and `store_bf16_from_fp32`. Instead, enable fp32/bf16 vec conversion in `to_dtype`. Then we always store bf16. ```python import torch import torch.nn as nn torch.manual_seed(420) from torch._inductor import config x = torch.randn(1, 18, dtype=torch.bfloat16) class ExampleModel(nn.Module): def __init__(self): super(ExampleModel, self).__init__() self.relu = nn.ReLU(inplace=True) # nn.ReLU(inplace=False) def forward(self, input1): out = self.relu(input1) # input1.copy_(out) return out func = ExampleModel() with torch.no_grad(): func.train(False) res1 = func(x) # without jit print(res1) jit_func = torch.compile(func) res2 = jit_func(x) print(res2) ``` Generated code without this PR: (`tm3` store is wrong, `tmp3` is `float` while `out_ptr1` is `bf16`) ``` auto tmp0 = load_bf16_as_float(out_ptr1 + static_cast<long>(i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = (tmp2); store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp3); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` Generated code with this PR: ``` auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(out_ptr1 + static_cast<long>(i0), 16); auto tmp1 = cvt_bf16_to_fp32(tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = cvt_fp32_to_bf16(tmp2); tmp3.store(out_ptr0 + static_cast<long>(i0), 16); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` This PR also fixed the data type propagation for `masked_subblock`. Before the masked_subblock's dtype is propagated by its input which is wrong. ``` opcode name target args kwargs ----------- --------- --------- -------------------------- -------- call_module masked_subblock1 masked_subblock1 (and__2, -inf) ``` Now we propagated it by subblock with the same name: ``` # graph for body.subblocks['masked_subblock1'] opcode name target args kwargs ----------- --------- --------- -------------------------- -------- placeholder ops ops () {} call_module get_index get_index ('index2',) {} call_method load load (ops, 'arg0_1', get_index) {} call_method to_dtype to_dtype (ops, load, torch.float32) {} output output output (to_dtype,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101042 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-07 15:55:25 +00:00
Aleksandar Samardžić	51e0f9e858	Add missing decompositons/lowerings for logical/bitwise operators (#102566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566 Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5	2023-06-02 14:27:17 +00:00
XiaobingSuper	1204463bd0	inductor: fix bfloat16 reduction crash issue which store float value to bfloat16 (#102719 ) For bfloat16 reduction, there has an wrong store issue which store float value as bfloat16: Before: ``` extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr0, float* out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={{-std::numeric_limits<float>::infinity()}}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L)) { auto tmp0 = load_bf16_as_float(in_ptr0 + static_cast<long>(i0 + (16Li1))); auto tmp1 = (tmp0); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp1); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i0)); } } } #pragma omp single { { for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { auto tmp0 = load_bf16_as_float(out_ptr0 + static_cast<long>(i0)); auto tmp1 = (tmp0); tmp1.store(out_ptr1 + static_cast<long>(i0)); } } } } } ''') ``` after: ``` extern "C" void kernel(const bfloat16 in_ptr0, bfloat16* out_ptr0, float* out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={{-std::numeric_limits<float>::infinity()}}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L)) { auto tmp0 = load_bf16_as_float(in_ptr0 + static_cast<long>(i0 + (16L*i1))); auto tmp1 = (tmp0); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp1); } store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp_acc0_vec); } } } #pragma omp single { { for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { auto tmp0 = load_bf16_as_float(out_ptr0 + static_cast<long>(i0)); auto tmp1 = (tmp0); tmp1.store(out_ptr1 + static_cast<long>(i0)); } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102719 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-06-02 08:34:29 +00:00
Peter Bell	2f96981e5a	[inductor] Reduce duplication of reduction combine functions (#99661 ) Currently reduction bodies are duplicated in several different places. This reduces duplication by `combine_fn` definition used in `_unroll_reduction_fn` and using it in the triton codegen. For cpp this also makes better use of `reduction_combine{,_vec}` by using them to generate the `omp declare reduction` line and the `vec_reduce_all` call. For triton the only change is that that the combine step gets spread over two lines, e.g. instead of: ```python _tmp1 = tl.where(rmask & xmask, triton_helpers.maximum(_tmp1, tmp0), _tmp1) ``` we get ```python tmp2 = triton_helpers.maximum(_tmp1, tmp0) _tmp1 = tl.where(rmask & xmask, tmp2, _tmp1) ``` For cpp the only change is that inplace reduction operations are now written as an out-of-place operation and an assignment, e.g. instead if ```cpp omp_out += omp_in ``` we generate ```cpp omp_out = omp_out + omp_in ``` Which is a purely cosmetic change Pull Request resolved: https://github.com/pytorch/pytorch/pull/99661 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-06-01 18:02:17 +00:00
XiaobingSuper	49cd184f89	inductor: improve the index range check for index_expr vec check (#102263 ) Fix https://github.com/pytorch/pytorch/issues/102065. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102263 Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5	2023-06-01 03:07:14 +00:00
kshitij12345	b1bc8aecf5	[inductor] erfinv: CPU/CUDA lowering (#101863 ) Add `erfinv` lowering for CUDA. On CPU, we just fallback to the aten operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101863 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2023-05-29 15:31:54 +00:00
Wang, Eikan	ce41faa2ae	Add cpp.max_horizontal_fusion_size to control the granularity of horizontal fusion (#99828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99828 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-05-26 05:20:49 +00:00
Wang, Eikan	6f464e0cf8	Invoke the bf16 load w/o #elements to bypass the temporary buffer allocation from the performance perspective. (#99822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99822 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-26 02:10:41 +00:00
Wang, Eikan	c3550d8376	Add fast path for BF16 kernel if all the operations within the kernel support bf16 (#99814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99814 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-26 02:08:53 +00:00
XiaobingSuper	4882cd0801	inductor: align cpp floordiv with python floordiv for dyanmic shape path (#102068 ) This PR does the following things: - Align the C++ behavior with Python for FloorDiv. - Always return expr dtype for some ops which not use expr's dtype to do the computation. After this PR, TIMM ```levit_128``` and ```volo_d1_224``` accuracy tests can be passed for dynamic shape path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102068 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-05-25 10:18:45 +00:00
Bin Bao	431344f2d0	[inductor] Refactor generate_kernel_call (#102018 ) Summary: Refactor generate_kernel_call to support codegen call to Triton kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/102018 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-05-23 15:54:49 +00:00

1 2 3 4

169 Commits