pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jiong Gong	5d62d12557	[Inductor] support transpose vertical reduction in cpp (#97781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781 Approved by: https://github.com/jansel	2023-04-03 02:02:15 +00:00
Jiong Gong	bf22ecba2a	[Inductor] support vertical reduction in cpp (#97644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644 Approved by: https://github.com/jansel	2023-04-03 01:29:12 +00:00
Jiong Gong	8e5f491623	[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626 ) Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626 Approved by: https://github.com/jansel	2023-04-03 01:11:20 +00:00
chunyuan	004bb34f42	inductor: fix vision_maskrcnn dynamic_shapes error on CPU (#97312 ) Fix several c++ compilation errors in `vision_maskrcnn` in dynamic_shapes cases: 1. convert `ceiling` to `std::ceil` in `CppPrinter`: ```bash error: ‘ceiling’ was not declared in this scope 17 \| for(long i1=0; i1<ceiling(1.8735363483429ks1); i1+=1) ``` 2. convert index in `store` to `INDEX_TYPE`: ```bash error: invalid types ‘float[double]’ for array subscript 52 \| out_ptr0[i2 + (i1(floor(1.8735363483429ks2))) + (i0(std::ceil((1.87353634834290ks1)))(floor(1.8735363483429ks2)))] = tmp30; ``` 3. convert offset, size, steps in loop to `INDEX_TYPE`: ```bash error: invalid controlling predicate 16 \| for(long i1=0; i1<std::ceil((1.87353634834290ks1)); i1+=1) ``` 4. convert index in `load` to `INDEX_TYPE`: ```bash error: invalid types ‘float[double]’ for array subscript 64 \| auto tmp0 = out_ptr0[i1 + (i0(floor(1.8735363483429ks2)))]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97312 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-03-29 10:24:57 +00:00
Jason Ansel	bc86af0d37	Remove DeferredIndentedBuffer (#97616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616 Approved by: https://github.com/desertfire	2023-03-28 23:13:41 +00:00
haozhe.zhu	a1ada050f8	do not insert to_dtype for memory copy only buffers (#97147 ) Remove redundant to_dtype like `load_bf16 + to_fp32 + to_bf16 + store_bf16` => `load_bf16 + store_bf16` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97147 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-27 14:55:41 +00:00
Liao, Xuan	a331cd4314	[inductor] fix cpp legalize bf16 reduction (#97228 ) When legalizing bf16 for reduction, operators with result dtype of torch.int64, like argmax, would encounter an assertion error now. The PR fixes for the case of int64, enabling several bf16 models (hf_Reformer, doctr_reco_predictor) to run successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97228 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire	2023-03-23 08:52:25 +00:00
Jiong Gong	35439e8610	[Inductor] add guards to guarantee vector int32 only used by comparison ops (for masked load) (#97144 ) Fix https://github.com/pytorch/pytorch/issues/97124 and https://github.com/pytorch/pytorch/issues/97127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97144 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-03-23 03:12:50 +00:00
Wang, Eikan	c5d7ed9423	[Inductor] Fix the issue that cannot pass lint check for debug mode (#97249 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-03-22 04:25:44 +00:00
Jiong Gong	4733de18fd	[Inductor] Add debug logging to explain reasons of disabling vectorization (#97108 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97108 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-03-22 02:38:34 +00:00
Wang, Eikan	915cbf8208	[Inductor] Eliminate redundant to_dtype node (#96650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-18 01:51:38 +00:00
Bin Bao	f03db8d6cb	[reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822. Solved the long compilation issue for inductor cpp tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-14 16:10:54 +00:00
Wang, Eikan	bdd09e68e4	[Inductor] Legalize BF16 (#96183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96183 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-03-14 10:16:15 +00:00
Jiong Gong	190e284bd3	[Inductor] apply vec float mask on logical comparison ops in cpp (#96502 ) Fix https://github.com/pytorch/pytorch/issues/96446 The root cause is that the logical comparison op works on the integer vector which is later used in the `where` op that expects a float vector. 1. Make sure float vec mask is applied on logical comparison ops. 2. Fix vec int specialization for `to_float_mask`. Assume int mask as input and returns the float mask with reinterpret cast. 3. Add a no-op specialization for `to_float_mask` function with the float vec as input. 4. Pass value instead of ref to `to_float_mask`. Passing by value should be efficient enough. 5. Remove a conditional check `!=0` in `masked()` since `to_float_mask` is guaranteed to return a float mask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96502 Approved by: https://github.com/EikanWang, https://github.com/XiaobingSuper, https://github.com/jansel	2023-03-14 08:47:14 +00:00
XiaobingSuper	279ada515a	inductor(cpu): make variable number used of masked vectorization path align with scalar path (#96510 ) Fix https://github.com/pytorch/pytorch/issues/96484, for CPP reduction vectorization path, there has an assumption that the vectorization path var number used should be aligned with the scalar path, but currently, masked doesn't meet such requirement and will report var not defined error. before: ``` { { { #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}}) float tmp7 = std::numeric_limits<float>::infinity(); auto tmp7_vec = at::vec::Vectorized<float>(tmp7); for(long i0=0; i0<0; i0+=1) { auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2)); auto tmp2 = tmp0 < tmp1; auto tmp3 = at::vec::Vectorized<float>(0.0); { auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]); tmp3 = decltype(tmp4)::blendv(tmp3, tmp4, to_float_mask(tmp2) != at::vec::Vectorized<float>(0)); } auto tmp6 = tmp3 + tmp5; tmp7_vec = at::vec::minimum(tmp7_vec, tmp6); } #pragma omp simd simdlen(8) reduction(min:tmp8) for(long i0=0; i0<2; i0+=1) { auto tmp6 = in_ptr1[i0]; auto tmp0 = static_cast<long>(0); auto tmp1 = static_cast<long>(2); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp7 = tmp5 + tmp6; tmp8 = std::min(tmp8, tmp7); } tmp7 = std::min(tmp7, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp7_vec)); out_ptr0[0] = tmp8; } } } ``` after: ``` { { { #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}}) float tmp8 = std::numeric_limits<float>::infinity(); auto tmp8_vec = at::vec::Vectorized<float>(tmp8); for(long i0=0; i0<0; i0+=1) { auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2)); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]); return tmp4; } ; auto tmp5 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2) != at::vec::Vectorized<float>(0)); auto tmp7 = tmp5 + tmp6; tmp8_vec = at::vec::minimum(tmp8_vec, tmp7); } #pragma omp simd simdlen(8) reduction(min:tmp8) for(long i0=0; i0<2; i0+=1) { auto tmp6 = in_ptr1[i0]; auto tmp0 = static_cast<long>(0); auto tmp1 = static_cast<long>(2); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp7 = tmp5 + tmp6; tmp8 = std::min(tmp8, tmp7); } tmp8 = std::min(tmp8, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp8_vec)); out_ptr0[0] = tmp8; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96510 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-13 09:41:23 +00:00
PyTorch MergeBot	fe05266fda	Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 )" This reverts commit `deaf9e5e65`. Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378	2023-03-09 01:45:24 +00:00
Bin Bao	deaf9e5e65	[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985 Approved by: https://github.com/jansel	2023-03-08 20:02:32 +00:00
Jason Ansel	fe4fec37a4	[inductor] Refactor IR printing (#96024 ) Reland #95567 part 2. The previous version of this had a bug which that added test triggers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96024 Approved by: https://github.com/ngimel	2023-03-07 02:23:06 +00:00
Liao, Xuan	e168dbb90a	[inductor] improve cpp vec implementation of square (#96072 ) For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance. Micro-benchmark performance for eager v.s. inductor: op=`aten.native_batch_norm.default` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> suite \| improvement_0.2 \| improvement_0.5 \| improvement_0.8 \| current_speedup_0.2 \| new_speedup_0.2 \| current_speedup_0.5 \| new_speedup_0.5 \| current_speedup_0.8 \| new_speedup_0.8 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- torchbench \| 8.82% \| 5.53% \| 32.19% \| 0.608006834 \| 0.661613139 \| 0.691743711 \| 0.729987622 \| 0.76176223 \| 1.00694842 timm \| 59.30% \| 63.01% \| 94.77% \| 0.650648524 \| 1.036498047 \| 0.676425152 \| 1.102667387 \| 0.695693384 \| 1.354992423 </body> </html> Model training performance for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> model \| improvement \| current_speedup \| new_speedup -- \| -- \| -- \| -- lcnet_050 multi-thread \| 5.16% \| 1.046 \| 1.1 lcnet_050 single-thread \| 21.81% \| 0.94 \| 1.145 mobilenet_v2 multi-thread \| 3.88% \| 1.135 \| 1.179 mobilenet_v2 single-thread \| 37.46% \| 0.929 \| 1.277 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2023-03-07 01:13:39 +00:00
zhuhaozhe	ebaf9af76e	use float to init reduction value (#95452 ) Fixes https://github.com/pytorch/pytorch/issues/95195, https://github.com/pytorch/pytorch/issues/95185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95452 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-06 08:49:36 +00:00
Jason Ansel	43dd043ea7	Revert "[inductor] Improve error messages (#95567 )" (#96014 ) This reverts commit `62b775583f`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014 Approved by: https://github.com/Chillee	2023-03-04 04:03:31 +00:00
Nikita Karetnikov	feffcafe09	[inductor] use FP div in CPP expr printer (#95698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95698 Approved by: https://github.com/ezyang, https://github.com/jgong5	2023-03-03 20:38:18 +00:00
PyTorch MergeBot	879400e4e8	Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 )" This reverts commit `73b66098b2`. Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, `73b66098b2 (11745396725)`	2023-03-03 17:33:27 +00:00
Bin Bao	73b66098b2	[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 ) Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822 Approved by: https://github.com/jansel	2023-03-03 14:18:09 +00:00
Jason Ansel	62b775583f	[inductor] Improve error messages (#95567 ) Example error message before/after (710 to 131 lines): https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567 Approved by: https://github.com/mlazos	2023-03-02 02:20:55 +00:00
Wang, Eikan	9da903f180	[Inductor] Fix the logical_and/logical_or vectorization issue (#95609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95609 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-01 08:21:57 +00:00
Wang, Eikan	c1f5e50fd1	[Inductor] Vectorize channels-last adaptive_avg_pool2d (#95608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95608 Approved by: https://github.com/jansel	2023-03-01 08:21:57 +00:00
Kazuaki Ishizaki	9a4cb9bcaf	Fix typos under torch/_inductor directory (#95601 ) This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95601 Approved by: https://github.com/ezyang	2023-02-27 19:00:17 +00:00
XiaobingSuper	4846d52212	inductor: fix complier error when trying to vectorize logit_and and logit_or (#95361 ) Currently, `operator&& ` and `operator\|\| ` don't have vectorization implementation, disable them now for a quick fix for 2.0 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95361 Approved by: https://github.com/ngimel, https://github.com/EikanWang	2023-02-24 02:30:13 +00:00
XiaobingSuper	1d7133c542	inductor(cpu): fix C++ compile error when sigmoid's post ops is a reduction op (#94890 ) For timm nfnet_l0 model. CPU path has the following error: `torch._dynamo.exc.BackendCompilerFailed: inductor raised CppCompileError: C++ compile error`. There has a simple test case: ``` def fn(x): x = torch.ops.aten.sigmoid.default(x) return torch.ops.aten.mean.dim(x, [-1, -2], True) x = torch.randn((1, 8, 8, 8)) opt_fn = torch._dynamo.optimize("inductor")(fn) opt_fn(x) real_out = fn(x) compiled_out = opt_fn(x) tol = 0.0001 print(torch.allclose(real_out, compiled_out, atol=tol, rtol=tol)) ``` before: ``` extern "C" void kernel(float* __restrict__ in_out_ptr0, const float* __restrict__ in_ptr0) { auto out_ptr0 = in_out_ptr0; { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp2 = 0; auto tmp2_vec = at::vec::Vectorized<float>(tmp2); for(long i1=0; i1<4; i1+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16i1) + (64i0)); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); tmp2_vec += tmp1; } #pragma omp simd simdlen(8) reduction(+:tmp3) for(long i1=64; i1<64; i1+=1) { auto tmp0 = in_ptr0[i1 + (64i0)]; auto tmp1 = std::exp(-tmp0); auto tmp2 = 1 / (1 + tmp1); tmp3 += tmp2; } tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec); out_ptr0[i0] = tmp3; } } } { for(long i0=0; i0<0; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64)); auto tmp2 = tmp0 / tmp1; tmp2.store(in_out_ptr0 + 16i0); } #pragma omp simd simdlen(8) for(long i0=0; i0<8; i0+=1) { auto tmp0 = out_ptr0[i0]; auto tmp1 = static_cast<float>(64); auto tmp2 = tmp0 / tmp1; in_out_ptr0[i0] = tmp2; } } } ``` after: ``` extern "C" void kernel(float __restrict__ in_out_ptr0, const float* __restrict__ in_ptr0) { auto out_ptr0 = in_out_ptr0; #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<8; i0+=1) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp2 = 0; auto tmp2_vec = at::vec::Vectorized<float>(tmp2); for(long i1=0; i1<4; i1+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16i1) + (64i0)); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); tmp2_vec += tmp1; } #pragma omp simd simdlen(8) reduction(+:tmp2) for(long i1=64; i1<64; i1+=1) { auto tmp0 = in_ptr0[i1 + (64i0)]; auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0)); tmp2 += tmp1; } tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec); out_ptr0[i0] = tmp2; } } } #pragma omp single { { for(long i0=0; i0<0; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64)); auto tmp2 = tmp0 / tmp1; tmp2.store(in_out_ptr0 + 16*i0); } #pragma omp simd simdlen(8) for(long i0=0; i0<8; i0+=1) { auto tmp0 = out_ptr0[i0]; auto tmp1 = static_cast<float>(64); auto tmp2 = tmp0 / tmp1; in_out_ptr0[i0] = tmp2; } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94890 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/lezcano	2023-02-15 17:13:45 +00:00
Fabio Rocha	1dbaa5c290	Use decompositions for some fallbacks introduced in #94039 (#94206 ) In some cases, implements required inductor primitives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94206 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-02-14 09:31:30 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
XiaobingSuper	02b8a7f473	inductor: don't do transpose vectoriztion if input ld depends on most inner var (#94493 ) Fixed https://github.com/pytorch/pytorch/issues/94269. For the following case: ``` *import torch import torchvision #import intel_extension_for_pytorch import torch._dynamo from torch._inductor import config class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): constant_pad_nd = x # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:195, code: kv = kv.unfold(2, self.win_size, self.block_size).unfold(3, self.win_size, self.block_size) as_strided: f32[1, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(constant_pad_nd, [1, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]); constant_pad_nd = None as_strided_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.as_strided.default(as_strided, [1, 384, 2, 2, 12, 12], [153600, 1, 61440, 3072, 7680, 384]); as_strided = None # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:197, code: kv = kv.reshape( clone_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.clone.default(as_strided_1, memory_format = torch.contiguous_format); as_strided_1 = None _unsafe_view_1: f32[8, 48, 4, 144] = torch.ops.aten._unsafe_view.default(clone_1, [8, 48, 4, 144]); clone_1 = None permute_2: f32[8, 4, 144, 48] = torch.ops.aten.permute.default(_unsafe_view_1, [0, 2, 3, 1]); _unsafe_view_1 = None # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:202, code: k, v = torch.split(kv, [self.dim_head_qk, self.dim_head_v], dim=-1) split_with_sizes = torch.ops.aten.split_with_sizes.default(permute_2, [16, 32], -1); permute_2 = None getitem: f32[8, 4, 144, 16] = split_with_sizes[0] getitem_1: f32[8, 4, 144, 32] = split_with_sizes[1]; split_with_sizes = None permute_3: f32[8, 4, 16, 144] = torch.ops.aten.permute.default(getitem, [0, 1, 3, 2]); getitem = None expand_1: f32[8, 4, 16, 144] = torch.ops.aten.expand.default(permute_3, [8, 4, 16, 144]); permute_3 = None clone_3: f32[8, 4, 16, 144] = torch.ops.aten.clone.default(expand_1, memory_format = torch.contiguous_format); expand_1 = None return clone_3 model = Model().eval() opt_model = torch._dynamo.optimize('inductor')(model) x = torch.randn(1, 384, 20, 20).to(memory_format=torch.channels_last) ref = model(x) with torch.no_grad(): for i in range(3): out = opt_model(x) print(torch.equal(ref, out)) ``` The generated code before this PR is: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/ni/cniims6nap7c5wars7cmtbjr3mw6b5cxyoyxmsu7ro2l5fkrwatl.h" extern "C" void kernel(const float __restrict__ in_ptr0, float* __restrict__ out_ptr0) { { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<4; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<1; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<9; i3+=1) { float tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr0 + (16i2) + (48i0) + (384((16i3) % 12)) + (3072(i1 % 2)) + (7680(((4i3) / 3))) + (61440(i1 / 2)), ((-7680)(i3 / 12)) + ((-384)(i3 % 12)) + (384((1 + i3) % 12)) + (7680(((1 + i3) / 12))), tmp0, 16); for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + 16i2_inner); tmp1.store(out_ptr0 + (16i3) + (144i2_inner) + (2304i1) + (2304i2) + (9216i0)); } } #pragma GCC ivdep for(long i3=144; i3<144; i3+=1) { for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp0 = in_ptr0[i2_inner + (16i2) + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2_inner) + (2304i1) + (2304i2) + (9216i0)] = tmp0; } } } #pragma GCC ivdep for(long i2=16; i2<16; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<144; i3+=1) { auto tmp0 = in_ptr0[i2 + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2) + (2304i1) + (9216i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg0_1 return (buf0, ) ``` After: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/dm/cdmaihqxwe73zkb3he2zizktpq5uujetg2db26c3r4lgsmlx3b4c.h" extern "C" void kernel(const float __restrict__ in_ptr0, float* __restrict__ out_ptr0) { { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<4; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<16; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<144; i3+=1) { auto tmp0 = in_ptr0[i2 + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2) + (2304i1) + (9216i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg0_1 return (buf0, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((1, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94493 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang	2023-02-10 09:04:45 +00:00
Wang, Eikan	1767026d1e	Abstract the optimization context information as a dedicated class to better organize the code (#92057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92057 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-08 08:25:22 +00:00
Wang, Eikan	88ef4739b2	Check the semantic of loading the mask value (#91755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91755 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-08 02:34:22 +00:00
Jiong Gong	a28a062938	[Inductor] Fix CPU vectorized implementation of mask calculation that breaks torch.where (#93922 ) Fix https://github.com/pytorch/pytorch/issues/93374 The cause of the issue is that the original vectorized float mask calculation doesn't consider the broadcast case. This PR adds the support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93922 Approved by: https://github.com/XiaobingSuper, https://github.com/desertfire, https://github.com/jansel	2023-02-07 11:30:21 +00:00
Wang, Eikan	9895c19a7a	To vectorize long datatype as mask index (#91076 ) In this PR, we record the current fx node being executed to cache additional information to simply the vectorization checker. In addition, we supported `masked` in this PR by simplifying it as `mask_load` to support `max_pool2d`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91076 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-05 03:36:22 +00:00
Liao, Xuan	11de399447	[inductor] fix cpu implement of torch.neg (#94035 ) Fixes #93380 Fix to maintain the data type after doing neg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94035 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-04 03:13:11 +00:00
Peter Bell	b7a5c79399	[inductor] Fix type inference in CPU masked operations (#93842 ) Fixes #93351 The existing code guesses that `tmp3` is probably a `float`, and so truncates any `double` values ```cpp float tmp3 = 0.0; if(tmp2) { auto tmp4 = in_ptr0[i0]; tmp3 = tmp4; } ``` The proposed change is to generate a lambda expression that represents the body of the masked operation, and infer the type from the return value: ```cpp auto tmp3 = [&] { auto tmp4 = in_ptr0[i0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93842 Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/jansel	2023-02-02 22:42:19 +00:00
Jason Ansel	45eadc2c4d	ConfigModule for _{dynamo,inductor}.config (#93252 ) This refactors the way dynamo/inductor configs are handled to check for invalid configs and add options like patching and serialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93252 Approved by: https://github.com/voznesenskym	2023-02-01 19:38:05 +00:00
min-jean-cho	68a40a47a0	[Inductor] Lower aten.tan (#92837 ) Related #92047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92837 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-01-24 16:35:40 +00:00
Horace He	19c9b09449	Replace IndexingDiv with FloorDiv in Inductor (#92878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878 Approved by: https://github.com/ezyang	2023-01-24 15:06:22 +00:00
Peter Bell	5644059489	[inductor] Lower torch.exp2 and use it for torch.pow(2, x) (#92632 ) Before ```python tmp0 = 2.0 tmp2 = tl.libdevice.pow(tmp0, tmp1) ``` After ```python tmp1 = tl.libdevice.exp2(tmp0) ``` I've benchmarked on CPU and CUDA with the following examples ``` @torch._dynamo.optimize() def exp2(x): return torch.pow(2, x) @torch._dynamo.optimize() def logaddexp2(a, b): m = torch.maximum(a, b) return m + torch.log2(1 + torch.pow(2, -torch.abs(a-b))) ``` triton is able to specialize `pow(2, x)` such that this makes no difference, but on CPU I see a surprisingly large speedup. \| device \| Function \| Master (us) \| This PR (us) \| Speedup \| \|--------\|-----------\|-------------\|--------------\|---------\| \| CUDA \| exp2 \| 64 \| 63 \| 1.0 \| \| \| logaddexp \| 109 \| 107 \| 1.0 \| \| CPU \| exp2 \| 220 \| 40 \| 5.5 \| \| \| logaddexp \| 282 \| 140 \| 2.0 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/92632 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-01-20 22:06:23 +00:00
Liao, Xuan	119d5e425c	[Inductor] decompose expm1 for CPP vec (#92289 ) For micro-bench op `aten.elu.default` in TIMM, the performance is not good even though with vectorization. `Elu` uses `expm1` as a sub-op. It turns out that inductor invokes sleef `expm1` function while aten decomposes it with `exp - 1`. The former one performs worse than the latter one. This PR decomposes `expm1` for cpp vectorization to make performance come back. Performance data for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> suite \| improved_ratio_speedup \| speedup_old \| RSD(3) \| speedup_new \| RSD(3) -- \| -- \| -- \| -- \| -- \| -- timm \| 114.38% \| 0.803447768 \| 8.39% \| 1.722458 \| 27.74% </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92289 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-20 05:29:32 +00:00
Peter Bell	30f2026863	[inductor] Promote half-precision CPU constants to float (#91224 ) Currently `aten.where` can fail with the following C++ compiler error: ``` error: operands to '?:' have different types 'c10::Half' and 'float' ``` This happens because `ops.load` is overridden to cast Half inputs to float, but `ops.constant` will load a Half without promoting to float. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91224 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ngimel	2023-01-18 01:04:36 +00:00
Jiong Gong	2eaa7a25d0	Fix model accuracy issue caused by vectorized transpose (#92299 ) Fix accuracy issues from models: jx_nest_base, cait_m36_384, XLNetLMHeadModel, Super_SloMo https://github.com/pytorch/torchdynamo/issues/2038 https://github.com/pytorch/torchdynamo/issues/2037 https://github.com/pytorch/torchdynamo/issues/2036 https://github.com/pytorch/torchdynamo/issues/2035 The inner loop list should be newly created in loop.clone(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92299 Approved by: https://github.com/desertfire	2023-01-17 17:53:45 +00:00
Jiong Gong	7c641eaaf0	[Inductor] Support vectorized transpose in CPP backend (#91532 ) Fix https://github.com/pytorch/torchdynamo/issues/1915 This PR adds the vectorization support for transposed operations in TorchInductor CPP backend. It contains the following changes: 1. `CppTile2DKernelChecker` is added to check the eligibility of applying the optimization. We only addresss a narrow set of situations. All of the following conditions should be met: 1) There exists one and only one fp32 load/store with outer loop var having contiguous buffer accesses. 2) When a load/store doesn't have contiguous access in an outer loop var, the access should be vectorizable from the inner-most dim. 3) No reduction. More scenarios/operations would be supported in the future PRs. 2. If `CppTile2DKernelChecker` reports the optimization is doable, `CppKernelProxy` would split/tile the loops from both the outer loop var having contiguous buffer access and the inner-most loop var. 3. The main loop split from the outer loop var is further split at the inner-most level and then handled by `CppTile2DKernel` and `CppTile2DTailKernel` which generate the transposed load/store. The former kernel does the vectorized transposed load/store on tiles and then does vectorized load/store/compute along the inner-most loop axis. The vectorized transpose micro-kernel implementation borrows/refers to that from FBGEMM. The latter kernel simply does scalar operations. 4. The tail loop split from the outer loop var directly calls `CppKernel` with scalar operations. Next steps: 1. Support vectorized transpose with smaller tile size at one dim but bigger tile size at the other, e.g., 3x784. 2. Support reduction vectorized on the outer loop var (contiguous from outer loop var, not with inner-most loop var) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91532 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-01-12 17:20:39 +00:00
Natalia Gimelshein	44413f2525	properly convert fill value to x dtype in constant_pad (#92045 ) Fixes #92038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92045 Approved by: https://github.com/desertfire	2023-01-12 05:41:10 +00:00
Jiong Gong	859ac58c54	[Inductor] Support loop split at given depth in CPP codegen (#91397 ) This PR refactors the loop related data structure to support the loop split at a given depth. Before this PR, the loop split is always supported at the inner-most level. With this PR, it is possible to support tiling at outer levels and at more than one levels. The `LoopNest` data structure is extended to support loop splits at various levels and renamed to `LoopNestWithSplit`. The `codegen_loops` function is also rewritten to be general to support arbitrary kernels set at the leaves of the loop structure. This PR also improves the handling of reduction loops with split. The main loop and tail loop now work on their own reduction variables in parallel without data dependency as previous do. With this, two workarounds can be removed in the `CppVecKernel`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91397 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-01-06 12:53:46 +00:00
Wu, Chunyuan	c99a2a43ad	[inductor] decompose tanh in CPP backend (#91687 ) ## Description The decomposition of `tanh` has been removed in https://github.com/pytorch/pytorch/pull/90889. ```python @register_decomposition([aten.tanh]) def tanh(x): return 2.0 / (1.0 + torch.exp(-2.0 * x)) - 1.0 ``` We've observed performance regression on CPU for `lennard_jones` in the TorchBench suite. This PR decomposes `tanh` in CPP backend to fix the regression. ### Performance - Model: lennard_jones - Machine: IceLake (32 cores per socket) - Configuration: single instance, 32 cores per instance - jemalloc and iomp enabled ```bash python benchmarks/dynamo/torchbench.py --inductor-settings --inductor --performance --float32 -dcpu -n500 --no-skip --dashboard --only=lennard_jones --quiet ``` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> Time before regression \| Time after regression \| Time with this PR -- \| -- \| -- 0.000262036 \| 0.0003618 \| 0.000267888 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91687 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-01-06 10:05:36 +00:00

1 2

79 Commits