pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Nikita Karetnikov	938c5da61e	[inductor] do not generate loops when the condition doesn't hold (#98185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98185 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-04-06 17:22:16 +00:00
Bin Bao	348dcf51e5	[inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088 ) Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one file, which unifies the codegen for AOT and non-AOT mode. There will be more refactoring for the AOT part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-06 15:59:55 +00:00
Liao, Xuan	177994eb54	[inductor] [cpp] fix bitwise codegen (#98056 ) Fixes #97968 Fix to maintain the data type after doing bitwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98056 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-04 01:33:31 +00:00
Bin Bao	96f548a1ac	[inductor] Add an AOT mode for the Triton backend (#98214 ) Summary: This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make the landing easier. This PR implements a two-pass wrapper codegen for the Triton backend to achieve ahead-of-time compilation. In the first pass, the regular python wrapper code will be generated, and then the generated code will be executed to perform Triton compilation and autotuning. After that, the second pass wrapper codegen will generate C++ wrapper with proper CUDA API to load and launch Triton-generated CUDA kernels. Like the AOT mode for the cpp backend, the next step would be to provide a more complete API for AOT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214 Approved by: https://github.com/eellison	2023-04-03 22:19:18 +00:00
Jiong Gong	5d62d12557	[Inductor] support transpose vertical reduction in cpp (#97781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781 Approved by: https://github.com/jansel	2023-04-03 02:02:15 +00:00
Jiong Gong	bf22ecba2a	[Inductor] support vertical reduction in cpp (#97644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644 Approved by: https://github.com/jansel	2023-04-03 01:29:12 +00:00
Jiong Gong	8e5f491623	[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626 ) Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626 Approved by: https://github.com/jansel	2023-04-03 01:11:20 +00:00
chunyuan	004bb34f42	inductor: fix vision_maskrcnn dynamic_shapes error on CPU (#97312 ) Fix several c++ compilation errors in `vision_maskrcnn` in dynamic_shapes cases: 1. convert `ceiling` to `std::ceil` in `CppPrinter`: ```bash error: ‘ceiling’ was not declared in this scope 17 \| for(long i1=0; i1<ceiling(1.8735363483429ks1); i1+=1) ``` 2. convert index in `store` to `INDEX_TYPE`: ```bash error: invalid types ‘float[double]’ for array subscript 52 \| out_ptr0[i2 + (i1(floor(1.8735363483429ks2))) + (i0(std::ceil((1.87353634834290ks1)))(floor(1.8735363483429ks2)))] = tmp30; ``` 3. convert offset, size, steps in loop to `INDEX_TYPE`: ```bash error: invalid controlling predicate 16 \| for(long i1=0; i1<std::ceil((1.87353634834290ks1)); i1+=1) ``` 4. convert index in `load` to `INDEX_TYPE`: ```bash error: invalid types ‘float[double]’ for array subscript 64 \| auto tmp0 = out_ptr0[i1 + (i0(floor(1.8735363483429ks2)))]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97312 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-03-29 10:24:57 +00:00
Jason Ansel	bc86af0d37	Remove DeferredIndentedBuffer (#97616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616 Approved by: https://github.com/desertfire	2023-03-28 23:13:41 +00:00
haozhe.zhu	a1ada050f8	do not insert to_dtype for memory copy only buffers (#97147 ) Remove redundant to_dtype like `load_bf16 + to_fp32 + to_bf16 + store_bf16` => `load_bf16 + store_bf16` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97147 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-27 14:55:41 +00:00
Liao, Xuan	a331cd4314	[inductor] fix cpp legalize bf16 reduction (#97228 ) When legalizing bf16 for reduction, operators with result dtype of torch.int64, like argmax, would encounter an assertion error now. The PR fixes for the case of int64, enabling several bf16 models (hf_Reformer, doctr_reco_predictor) to run successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97228 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire	2023-03-23 08:52:25 +00:00
Jiong Gong	35439e8610	[Inductor] add guards to guarantee vector int32 only used by comparison ops (for masked load) (#97144 ) Fix https://github.com/pytorch/pytorch/issues/97124 and https://github.com/pytorch/pytorch/issues/97127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97144 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-03-23 03:12:50 +00:00
Wang, Eikan	c5d7ed9423	[Inductor] Fix the issue that cannot pass lint check for debug mode (#97249 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-03-22 04:25:44 +00:00
Jiong Gong	4733de18fd	[Inductor] Add debug logging to explain reasons of disabling vectorization (#97108 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97108 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-03-22 02:38:34 +00:00
Wang, Eikan	915cbf8208	[Inductor] Eliminate redundant to_dtype node (#96650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-18 01:51:38 +00:00
Bin Bao	f03db8d6cb	[reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822. Solved the long compilation issue for inductor cpp tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-14 16:10:54 +00:00
Wang, Eikan	bdd09e68e4	[Inductor] Legalize BF16 (#96183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96183 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-03-14 10:16:15 +00:00
Jiong Gong	190e284bd3	[Inductor] apply vec float mask on logical comparison ops in cpp (#96502 ) Fix https://github.com/pytorch/pytorch/issues/96446 The root cause is that the logical comparison op works on the integer vector which is later used in the `where` op that expects a float vector. 1. Make sure float vec mask is applied on logical comparison ops. 2. Fix vec int specialization for `to_float_mask`. Assume int mask as input and returns the float mask with reinterpret cast. 3. Add a no-op specialization for `to_float_mask` function with the float vec as input. 4. Pass value instead of ref to `to_float_mask`. Passing by value should be efficient enough. 5. Remove a conditional check `!=0` in `masked()` since `to_float_mask` is guaranteed to return a float mask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96502 Approved by: https://github.com/EikanWang, https://github.com/XiaobingSuper, https://github.com/jansel	2023-03-14 08:47:14 +00:00
XiaobingSuper	279ada515a	inductor(cpu): make variable number used of masked vectorization path align with scalar path (#96510 ) Fix https://github.com/pytorch/pytorch/issues/96484, for CPP reduction vectorization path, there has an assumption that the vectorization path var number used should be aligned with the scalar path, but currently, masked doesn't meet such requirement and will report var not defined error. before: ``` { { { #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}}) float tmp7 = std::numeric_limits<float>::infinity(); auto tmp7_vec = at::vec::Vectorized<float>(tmp7); for(long i0=0; i0<0; i0+=1) { auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2)); auto tmp2 = tmp0 < tmp1; auto tmp3 = at::vec::Vectorized<float>(0.0); { auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]); tmp3 = decltype(tmp4)::blendv(tmp3, tmp4, to_float_mask(tmp2) != at::vec::Vectorized<float>(0)); } auto tmp6 = tmp3 + tmp5; tmp7_vec = at::vec::minimum(tmp7_vec, tmp6); } #pragma omp simd simdlen(8) reduction(min:tmp8) for(long i0=0; i0<2; i0+=1) { auto tmp6 = in_ptr1[i0]; auto tmp0 = static_cast<long>(0); auto tmp1 = static_cast<long>(2); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp7 = tmp5 + tmp6; tmp8 = std::min(tmp8, tmp7); } tmp7 = std::min(tmp7, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp7_vec)); out_ptr0[0] = tmp8; } } } ``` after: ``` { { { #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}}) float tmp8 = std::numeric_limits<float>::infinity(); auto tmp8_vec = at::vec::Vectorized<float>(tmp8); for(long i0=0; i0<0; i0+=1) { auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2)); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]); return tmp4; } ; auto tmp5 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2) != at::vec::Vectorized<float>(0)); auto tmp7 = tmp5 + tmp6; tmp8_vec = at::vec::minimum(tmp8_vec, tmp7); } #pragma omp simd simdlen(8) reduction(min:tmp8) for(long i0=0; i0<2; i0+=1) { auto tmp6 = in_ptr1[i0]; auto tmp0 = static_cast<long>(0); auto tmp1 = static_cast<long>(2); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp7 = tmp5 + tmp6; tmp8 = std::min(tmp8, tmp7); } tmp8 = std::min(tmp8, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp8_vec)); out_ptr0[0] = tmp8; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96510 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-13 09:41:23 +00:00
PyTorch MergeBot	fe05266fda	Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 )" This reverts commit `deaf9e5e65`. Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378	2023-03-09 01:45:24 +00:00
Bin Bao	deaf9e5e65	[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985 Approved by: https://github.com/jansel	2023-03-08 20:02:32 +00:00
Jason Ansel	fe4fec37a4	[inductor] Refactor IR printing (#96024 ) Reland #95567 part 2. The previous version of this had a bug which that added test triggers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96024 Approved by: https://github.com/ngimel	2023-03-07 02:23:06 +00:00
Liao, Xuan	e168dbb90a	[inductor] improve cpp vec implementation of square (#96072 ) For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance. Micro-benchmark performance for eager v.s. inductor: op=`aten.native_batch_norm.default` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> suite \| improvement_0.2 \| improvement_0.5 \| improvement_0.8 \| current_speedup_0.2 \| new_speedup_0.2 \| current_speedup_0.5 \| new_speedup_0.5 \| current_speedup_0.8 \| new_speedup_0.8 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- torchbench \| 8.82% \| 5.53% \| 32.19% \| 0.608006834 \| 0.661613139 \| 0.691743711 \| 0.729987622 \| 0.76176223 \| 1.00694842 timm \| 59.30% \| 63.01% \| 94.77% \| 0.650648524 \| 1.036498047 \| 0.676425152 \| 1.102667387 \| 0.695693384 \| 1.354992423 </body> </html> Model training performance for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> model \| improvement \| current_speedup \| new_speedup -- \| -- \| -- \| -- lcnet_050 multi-thread \| 5.16% \| 1.046 \| 1.1 lcnet_050 single-thread \| 21.81% \| 0.94 \| 1.145 mobilenet_v2 multi-thread \| 3.88% \| 1.135 \| 1.179 mobilenet_v2 single-thread \| 37.46% \| 0.929 \| 1.277 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2023-03-07 01:13:39 +00:00
zhuhaozhe	ebaf9af76e	use float to init reduction value (#95452 ) Fixes https://github.com/pytorch/pytorch/issues/95195, https://github.com/pytorch/pytorch/issues/95185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95452 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-06 08:49:36 +00:00
Jason Ansel	43dd043ea7	Revert "[inductor] Improve error messages (#95567 )" (#96014 ) This reverts commit `62b775583f`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014 Approved by: https://github.com/Chillee	2023-03-04 04:03:31 +00:00
Nikita Karetnikov	feffcafe09	[inductor] use FP div in CPP expr printer (#95698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95698 Approved by: https://github.com/ezyang, https://github.com/jgong5	2023-03-03 20:38:18 +00:00
PyTorch MergeBot	879400e4e8	Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 )" This reverts commit `73b66098b2`. Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, `73b66098b2 (11745396725)`	2023-03-03 17:33:27 +00:00
Bin Bao	73b66098b2	[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 ) Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822 Approved by: https://github.com/jansel	2023-03-03 14:18:09 +00:00
Jason Ansel	62b775583f	[inductor] Improve error messages (#95567 ) Example error message before/after (710 to 131 lines): https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567 Approved by: https://github.com/mlazos	2023-03-02 02:20:55 +00:00
Wang, Eikan	9da903f180	[Inductor] Fix the logical_and/logical_or vectorization issue (#95609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95609 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-01 08:21:57 +00:00
Wang, Eikan	c1f5e50fd1	[Inductor] Vectorize channels-last adaptive_avg_pool2d (#95608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95608 Approved by: https://github.com/jansel	2023-03-01 08:21:57 +00:00
Kazuaki Ishizaki	9a4cb9bcaf	Fix typos under torch/_inductor directory (#95601 ) This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95601 Approved by: https://github.com/ezyang	2023-02-27 19:00:17 +00:00
XiaobingSuper	4846d52212	inductor: fix complier error when trying to vectorize logit_and and logit_or (#95361 ) Currently, `operator&& ` and `operator\|\| ` don't have vectorization implementation, disable them now for a quick fix for 2.0 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95361 Approved by: https://github.com/ngimel, https://github.com/EikanWang	2023-02-24 02:30:13 +00:00
XiaobingSuper	1d7133c542	inductor(cpu): fix C++ compile error when sigmoid's post ops is a reduction op (#94890 ) For timm nfnet_l0 model. CPU path has the following error: `torch._dynamo.exc.BackendCompilerFailed: inductor raised CppCompileError: C++ compile error`. There has a simple test case: ``` def fn(x): x = torch.ops.aten.sigmoid.default(x) return torch.ops.aten.mean.dim(x, [-1, -2], True) x = torch.randn((1, 8, 8, 8)) opt_fn = torch._dynamo.optimize("inductor")(fn) opt_fn(x) real_out = fn(x) compiled_out = opt_fn(x) tol = 0.0001 print(torch.allclose(real_out, compiled_out, atol=tol, rtol=tol)) ``` before: ``` extern "C" void kernel(float* __restrict__ in_out_ptr0, const float* __restrict__ in_ptr0) { auto out_ptr0 = in_out_ptr0; { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp2 = 0; auto tmp2_vec = at::vec::Vectorized<float>(tmp2); for(long i1=0; i1<4; i1+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16i1) + (64i0)); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); tmp2_vec += tmp1; } #pragma omp simd simdlen(8) reduction(+:tmp3) for(long i1=64; i1<64; i1+=1) { auto tmp0 = in_ptr0[i1 + (64i0)]; auto tmp1 = std::exp(-tmp0); auto tmp2 = 1 / (1 + tmp1); tmp3 += tmp2; } tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec); out_ptr0[i0] = tmp3; } } } { for(long i0=0; i0<0; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64)); auto tmp2 = tmp0 / tmp1; tmp2.store(in_out_ptr0 + 16i0); } #pragma omp simd simdlen(8) for(long i0=0; i0<8; i0+=1) { auto tmp0 = out_ptr0[i0]; auto tmp1 = static_cast<float>(64); auto tmp2 = tmp0 / tmp1; in_out_ptr0[i0] = tmp2; } } } ``` after: ``` extern "C" void kernel(float __restrict__ in_out_ptr0, const float* __restrict__ in_ptr0) { auto out_ptr0 = in_out_ptr0; #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<8; i0+=1) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp2 = 0; auto tmp2_vec = at::vec::Vectorized<float>(tmp2); for(long i1=0; i1<4; i1+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16i1) + (64i0)); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); tmp2_vec += tmp1; } #pragma omp simd simdlen(8) reduction(+:tmp2) for(long i1=64; i1<64; i1+=1) { auto tmp0 = in_ptr0[i1 + (64i0)]; auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0)); tmp2 += tmp1; } tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec); out_ptr0[i0] = tmp2; } } } #pragma omp single { { for(long i0=0; i0<0; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64)); auto tmp2 = tmp0 / tmp1; tmp2.store(in_out_ptr0 + 16*i0); } #pragma omp simd simdlen(8) for(long i0=0; i0<8; i0+=1) { auto tmp0 = out_ptr0[i0]; auto tmp1 = static_cast<float>(64); auto tmp2 = tmp0 / tmp1; in_out_ptr0[i0] = tmp2; } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94890 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/lezcano	2023-02-15 17:13:45 +00:00
Fabio Rocha	1dbaa5c290	Use decompositions for some fallbacks introduced in #94039 (#94206 ) In some cases, implements required inductor primitives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94206 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-02-14 09:31:30 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
XiaobingSuper	02b8a7f473	inductor: don't do transpose vectoriztion if input ld depends on most inner var (#94493 ) Fixed https://github.com/pytorch/pytorch/issues/94269. For the following case: ``` *import torch import torchvision #import intel_extension_for_pytorch import torch._dynamo from torch._inductor import config class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): constant_pad_nd = x # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:195, code: kv = kv.unfold(2, self.win_size, self.block_size).unfold(3, self.win_size, self.block_size) as_strided: f32[1, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(constant_pad_nd, [1, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]); constant_pad_nd = None as_strided_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.as_strided.default(as_strided, [1, 384, 2, 2, 12, 12], [153600, 1, 61440, 3072, 7680, 384]); as_strided = None # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:197, code: kv = kv.reshape( clone_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.clone.default(as_strided_1, memory_format = torch.contiguous_format); as_strided_1 = None _unsafe_view_1: f32[8, 48, 4, 144] = torch.ops.aten._unsafe_view.default(clone_1, [8, 48, 4, 144]); clone_1 = None permute_2: f32[8, 4, 144, 48] = torch.ops.aten.permute.default(_unsafe_view_1, [0, 2, 3, 1]); _unsafe_view_1 = None # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:202, code: k, v = torch.split(kv, [self.dim_head_qk, self.dim_head_v], dim=-1) split_with_sizes = torch.ops.aten.split_with_sizes.default(permute_2, [16, 32], -1); permute_2 = None getitem: f32[8, 4, 144, 16] = split_with_sizes[0] getitem_1: f32[8, 4, 144, 32] = split_with_sizes[1]; split_with_sizes = None permute_3: f32[8, 4, 16, 144] = torch.ops.aten.permute.default(getitem, [0, 1, 3, 2]); getitem = None expand_1: f32[8, 4, 16, 144] = torch.ops.aten.expand.default(permute_3, [8, 4, 16, 144]); permute_3 = None clone_3: f32[8, 4, 16, 144] = torch.ops.aten.clone.default(expand_1, memory_format = torch.contiguous_format); expand_1 = None return clone_3 model = Model().eval() opt_model = torch._dynamo.optimize('inductor')(model) x = torch.randn(1, 384, 20, 20).to(memory_format=torch.channels_last) ref = model(x) with torch.no_grad(): for i in range(3): out = opt_model(x) print(torch.equal(ref, out)) ``` The generated code before this PR is: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/ni/cniims6nap7c5wars7cmtbjr3mw6b5cxyoyxmsu7ro2l5fkrwatl.h" extern "C" void kernel(const float __restrict__ in_ptr0, float* __restrict__ out_ptr0) { { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<4; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<1; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<9; i3+=1) { float tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr0 + (16i2) + (48i0) + (384((16i3) % 12)) + (3072(i1 % 2)) + (7680(((4i3) / 3))) + (61440(i1 / 2)), ((-7680)(i3 / 12)) + ((-384)(i3 % 12)) + (384((1 + i3) % 12)) + (7680(((1 + i3) / 12))), tmp0, 16); for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + 16i2_inner); tmp1.store(out_ptr0 + (16i3) + (144i2_inner) + (2304i1) + (2304i2) + (9216i0)); } } #pragma GCC ivdep for(long i3=144; i3<144; i3+=1) { for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp0 = in_ptr0[i2_inner + (16i2) + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2_inner) + (2304i1) + (2304i2) + (9216i0)] = tmp0; } } } #pragma GCC ivdep for(long i2=16; i2<16; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<144; i3+=1) { auto tmp0 = in_ptr0[i2 + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2) + (2304i1) + (9216i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg0_1 return (buf0, ) ``` After: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/dm/cdmaihqxwe73zkb3he2zizktpq5uujetg2db26c3r4lgsmlx3b4c.h" extern "C" void kernel(const float __restrict__ in_ptr0, float* __restrict__ out_ptr0) { { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<4; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<16; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<144; i3+=1) { auto tmp0 = in_ptr0[i2 + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2) + (2304i1) + (9216i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg0_1 return (buf0, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((1, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94493 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang	2023-02-10 09:04:45 +00:00
Wang, Eikan	1767026d1e	Abstract the optimization context information as a dedicated class to better organize the code (#92057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92057 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-08 08:25:22 +00:00
Wang, Eikan	88ef4739b2	Check the semantic of loading the mask value (#91755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91755 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-08 02:34:22 +00:00
Jiong Gong	a28a062938	[Inductor] Fix CPU vectorized implementation of mask calculation that breaks torch.where (#93922 ) Fix https://github.com/pytorch/pytorch/issues/93374 The cause of the issue is that the original vectorized float mask calculation doesn't consider the broadcast case. This PR adds the support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93922 Approved by: https://github.com/XiaobingSuper, https://github.com/desertfire, https://github.com/jansel	2023-02-07 11:30:21 +00:00
Wang, Eikan	9895c19a7a	To vectorize long datatype as mask index (#91076 ) In this PR, we record the current fx node being executed to cache additional information to simply the vectorization checker. In addition, we supported `masked` in this PR by simplifying it as `mask_load` to support `max_pool2d`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91076 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-05 03:36:22 +00:00
Liao, Xuan	11de399447	[inductor] fix cpu implement of torch.neg (#94035 ) Fixes #93380 Fix to maintain the data type after doing neg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94035 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-04 03:13:11 +00:00
Peter Bell	b7a5c79399	[inductor] Fix type inference in CPU masked operations (#93842 ) Fixes #93351 The existing code guesses that `tmp3` is probably a `float`, and so truncates any `double` values ```cpp float tmp3 = 0.0; if(tmp2) { auto tmp4 = in_ptr0[i0]; tmp3 = tmp4; } ``` The proposed change is to generate a lambda expression that represents the body of the masked operation, and infer the type from the return value: ```cpp auto tmp3 = [&] { auto tmp4 = in_ptr0[i0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93842 Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/jansel	2023-02-02 22:42:19 +00:00
Jason Ansel	45eadc2c4d	ConfigModule for _{dynamo,inductor}.config (#93252 ) This refactors the way dynamo/inductor configs are handled to check for invalid configs and add options like patching and serialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93252 Approved by: https://github.com/voznesenskym	2023-02-01 19:38:05 +00:00
min-jean-cho	68a40a47a0	[Inductor] Lower aten.tan (#92837 ) Related #92047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92837 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-01-24 16:35:40 +00:00
Horace He	19c9b09449	Replace IndexingDiv with FloorDiv in Inductor (#92878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878 Approved by: https://github.com/ezyang	2023-01-24 15:06:22 +00:00
Peter Bell	5644059489	[inductor] Lower torch.exp2 and use it for torch.pow(2, x) (#92632 ) Before ```python tmp0 = 2.0 tmp2 = tl.libdevice.pow(tmp0, tmp1) ``` After ```python tmp1 = tl.libdevice.exp2(tmp0) ``` I've benchmarked on CPU and CUDA with the following examples ``` @torch._dynamo.optimize() def exp2(x): return torch.pow(2, x) @torch._dynamo.optimize() def logaddexp2(a, b): m = torch.maximum(a, b) return m + torch.log2(1 + torch.pow(2, -torch.abs(a-b))) ``` triton is able to specialize `pow(2, x)` such that this makes no difference, but on CPU I see a surprisingly large speedup. \| device \| Function \| Master (us) \| This PR (us) \| Speedup \| \|--------\|-----------\|-------------\|--------------\|---------\| \| CUDA \| exp2 \| 64 \| 63 \| 1.0 \| \| \| logaddexp \| 109 \| 107 \| 1.0 \| \| CPU \| exp2 \| 220 \| 40 \| 5.5 \| \| \| logaddexp \| 282 \| 140 \| 2.0 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/92632 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-01-20 22:06:23 +00:00
Liao, Xuan	119d5e425c	[Inductor] decompose expm1 for CPP vec (#92289 ) For micro-bench op `aten.elu.default` in TIMM, the performance is not good even though with vectorization. `Elu` uses `expm1` as a sub-op. It turns out that inductor invokes sleef `expm1` function while aten decomposes it with `exp - 1`. The former one performs worse than the latter one. This PR decomposes `expm1` for cpp vectorization to make performance come back. Performance data for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> suite \| improved_ratio_speedup \| speedup_old \| RSD(3) \| speedup_new \| RSD(3) -- \| -- \| -- \| -- \| -- \| -- timm \| 114.38% \| 0.803447768 \| 8.39% \| 1.722458 \| 27.74% </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92289 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-20 05:29:32 +00:00
Peter Bell	30f2026863	[inductor] Promote half-precision CPU constants to float (#91224 ) Currently `aten.where` can fail with the following C++ compiler error: ``` error: operands to '?:' have different types 'c10::Half' and 'float' ``` This happens because `ops.load` is overridden to cast Half inputs to float, but `ops.constant` will load a Half without promoting to float. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91224 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ngimel	2023-01-18 01:04:36 +00:00
Jiong Gong	2eaa7a25d0	Fix model accuracy issue caused by vectorized transpose (#92299 ) Fix accuracy issues from models: jx_nest_base, cait_m36_384, XLNetLMHeadModel, Super_SloMo https://github.com/pytorch/torchdynamo/issues/2038 https://github.com/pytorch/torchdynamo/issues/2037 https://github.com/pytorch/torchdynamo/issues/2036 https://github.com/pytorch/torchdynamo/issues/2035 The inner loop list should be newly created in loop.clone(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92299 Approved by: https://github.com/desertfire	2023-01-17 17:53:45 +00:00

1 2

83 Commits