pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jiong Gong	cb48f7855a	[inductor cpu] fix uint8 add and sub (#113253 ) Fix https://github.com/pytorch/pytorch/issues/113016 and https://github.com/pytorch/pytorch/issues/113020 and https://github.com/pytorch/pytorch/issues/113141 and https://github.com/pytorch/pytorch/issues/113143 and https://github.com/pytorch/pytorch/issues/113144 Explicit typecast result of add/sub to uint8 (similar to how we fixed mul previously) to avoid implicit type promotion from C. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113253 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-11-10 04:06:42 +00:00
Jiong Gong	8c704f7a0e	[inductor cpp] fix argmax with >1 reduction dims (#113168 ) Fix #113013. The argmax (and argmin) implementation doesn't handle the index compute properly when the number of reduction dims is larger than 1. It wrongly assumed only one reduction dim. With the given reproducer, the generated code before the change: ```c++ #include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h" extern "C" void kernel(const double* in_ptr0, long* out_ptr0) { { { struct IndexValue_1 {size_t index; double value;}; IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()}; #if !defined(__clang_major__) \|\| __clang_major__ > 9 #pragma omp declare reduction(argmax : IndexValue_1 :\ omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\ omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\ initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()}) #endif for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(0); auto tmp1 = c10::convert<long>(1); auto tmp2 = tmp0 < tmp1; auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3Lx1), 2L)); auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp5 = tmp3 < tmp4; auto tmp6 = tmp2 & tmp5; auto tmp7 = [&] { auto tmp8 = in_ptr0[static_cast<long>((3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp8; } ; auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0); auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp11 = tmp10 < tmp4; auto tmp12 = tmp2 & tmp11; auto tmp13 = [&] { auto tmp14 = in_ptr0[static_cast<long>(1L + (3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp14; } ; auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0); auto tmp16 = tmp15 + tmp9; auto tmp17 = [&] { auto tmp18 = c10::convert<double>(1.0); return tmp18; } ; auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0); auto tmp20 = [&] { auto tmp21 = c10::convert<double>(1.0); return tmp21; } ; auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0); auto tmp23 = tmp22 + tmp19; auto tmp24 = tmp16 / tmp23; if (tmp_acc0.value < tmp24) { tmp_acc0.index = x1; tmp_acc0.value = tmp24; // both x0 and x1 are reduction vars while only x1 is assigned to tmp_acc0.index } } } out_ptr0[static_cast<long>(0L)] = tmp_acc0.index; } } } ``` After fix: ```c++ #include "/tmp/torchinductor_jgong5/tb/ctbgktuhgnnlel6ipqkfk76lfztr5pledachdkcq3asdqtlxpzt6.h" extern "C" void kernel(const double in_ptr0, long* out_ptr0) { { { struct IndexValue_1 {size_t index; double value;}; IndexValue_1 tmp_acc0{0, -std::numeric_limits<double>::infinity()}; #if !defined(__clang_major__) \|\| __clang_major__ > 9 #pragma omp declare reduction(argmax : IndexValue_1 :\ omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\ omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\ initializer(omp_priv = {0, -std::numeric_limits<double>::infinity()}) #endif for(long x0=static_cast<long>(0L); x0<static_cast<long>(9L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(2L); x1+=static_cast<long>(1L)) { auto tmp0 = c10::convert<long>(0); auto tmp1 = c10::convert<long>(1); auto tmp2 = tmp0 < tmp1; auto tmp3 = c10::convert<long>(at::native::div_floor_integer((3Lx1), 2L)); auto tmp4 = c10::convert<long>(2L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp5 = tmp3 < tmp4; auto tmp6 = tmp2 & tmp5; auto tmp7 = [&] { auto tmp8 = in_ptr0[static_cast<long>((3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp8; } ; auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0); auto tmp10 = c10::convert<long>(1L + (at::native::div_floor_integer((3Lx1), 2L))); auto tmp11 = tmp10 < tmp4; auto tmp12 = tmp2 & tmp11; auto tmp13 = [&] { auto tmp14 = in_ptr0[static_cast<long>(1L + (3Lx0) + (at::native::div_floor_integer((3Lx1), 2L)))]; return tmp14; } ; auto tmp15 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0); auto tmp16 = tmp15 + tmp9; auto tmp17 = [&] { auto tmp18 = c10::convert<double>(1.0); return tmp18; } ; auto tmp19 = tmp6 ? tmp17() : static_cast<decltype(tmp17())>(0.0); auto tmp20 = [&] { auto tmp21 = c10::convert<double>(1.0); return tmp21; } ; auto tmp22 = tmp12 ? tmp20() : static_cast<decltype(tmp20())>(0.0); auto tmp23 = tmp22 + tmp19; auto tmp24 = tmp16 / tmp23; if (tmp_acc0.value < tmp24) { tmp_acc0.index = static_cast<long>(x1 + (2Lx0)); tmp_acc0.value = tmp24; } } } out_ptr0[static_cast<long>(0L)] = tmp_acc0.index; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113168 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-11-09 11:47:51 +00:00
Jez Ng	297c26bb8e	Support fp8 in AOTInductor + support optional<> in C ABI (#112527 ) This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358 It turns out that we need to add support for optional types in order to support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface can't support optional<> directly, I am passing in optional types via pointer instead. `AtenTensorHandle`s are already pointers, so nothing needs to change there. Only value types need to change. We decided on this approach instead of adding an extra `bool` param to the callee because this simplifies things. Having the same number of arguments regardless of whether we are emitting Python / C++ / ABI-compatible C++ makes codegen easier. There are a number of existing ABI-compatible functions that have optional-typed value parameters. Previously, they just assumed they would never be passed a `nullopt` / `None` at runtime. Changing them to use pointer types now would break ABI stability, so I have created an exclude list for those functions. Finally, I think the current implementation is kind of messy, and only works for FallbackKernels, even though technically ExternKernels could also have the same issue. It also doesn't support optional types nested in lists. I've left FIXME comments for both issues. Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-11-08 22:56:48 +00:00
Jez Ng	dc63248b76	Make dynamo configs more amenable to static type checking (#112130 ) `install_config_module` makes a regular module into a ConfigModule with extra methods defined on it. mypy thinks those extra methods (or module functions) are undefined since it cannot analyze something so dynamic. As a workaround, I've created a fake module that defines these extra functions, which I import into the config modules during type checking. As part of this change, I've also added more types to config_utils.py and enabled typechecking for torch/_dynamo/config.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130 Approved by: https://github.com/jansel	2023-11-08 21:17:45 +00:00
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
Peter Bell	718035791d	Prefer `e.is_number` over `not e.free_symbols` in SymPy (#112688 ) We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times. Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is horribly slow. It turns out though that there is another propery `is_number` that does what we want. > property is_number: > > Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster > than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined > function. Even further, we also avoid the overhead of building the unnecessary set object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688 Approved by: https://github.com/lezcano	2023-11-06 20:05:13 +00:00
Jiong Gong	e061144aaf	[inductor] replace ops.div with ops.truediv (#112243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243 Approved by: https://github.com/lezcano ghstack dependencies: #112234	2023-11-01 05:50:51 +00:00
Shunting Zhang	fbafff3668	[reland][inductor] benchmark fusion (#112450 ) reland https://github.com/pytorch/pytorch/pull/108193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450 Approved by: https://github.com/jansel	2023-10-31 18:17:06 +00:00
Jiong Gong	a1c56df1f0	[inductor cpp] vectorize support for truediv (#112234 ) Ops like group_norm has `ops.truediv` that doesn't have vectorization support yet. This PR adds the support. `test_group_norm_vec` Before: ```c++ extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0))); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x2 + (1024Lx1) + (32768Lx0))]; auto tmp1 = out_ptr0[static_cast<long>(x1 + (32Lx0))]; auto tmp3 = out_ptr1[static_cast<long>(x1 + (32Lx0))]; auto tmp10 = in_ptr1[static_cast<long>(x1)]; auto tmp12 = in_ptr2[static_cast<long>(x1)]; auto tmp2 = tmp0 - tmp1; auto tmp4 = c10::convert<float>(1024.0); auto tmp5 = tmp3 / tmp4; auto tmp6 = c10::convert<float>(1e-05); auto tmp7 = tmp5 + tmp6; auto tmp8 = 1 / std::sqrt(tmp7); auto tmp9 = decltype(tmp2)(tmp2 tmp8); auto tmp11 = decltype(tmp9)(tmp9 * tmp10); auto tmp13 = tmp11 + tmp12; out_ptr2[static_cast<long>(x2 + (1024Lx1) + (32768Lx0))] = tmp13; } } } } } } ``` After: ```c++ extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024Lx0))); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x0)] = static_cast<float>(tmp_acc0.m2); } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (1024Lx1) + (32768Lx0))); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(out_ptr0[static_cast<long>(x1 + (32Lx0))])); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(out_ptr1[static_cast<long>(x1 + (32Lx0))])); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(in_ptr1[static_cast<long>(x1)])); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(in_ptr2[static_cast<long>(x1)])); auto tmp2 = tmp0 - tmp1; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1024.0)); auto tmp5 = tmp3 / tmp4; auto tmp6 = at::vec::Vectorized<float>(static_cast<float>(1e-05)); auto tmp7 = tmp5 + tmp6; auto tmp8 = tmp7.rsqrt(); auto tmp9 = tmp2 tmp8; auto tmp11 = tmp9 * tmp10; auto tmp13 = tmp11 + tmp12; tmp13.store(out_ptr2 + static_cast<long>(x2 + (1024Lx1) + (32768Lx0))); } } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112234 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-10-31 17:15:21 +00:00
PyTorch MergeBot	fc0b0820fc	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `b110d87ac2`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/ZainRizvi due to Stack breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1785922905))	2023-10-30 19:45:41 +00:00
chilli	b110d87ac2	Readded device_assert skipping in index and index_put (and also added (#112093 ) copy to noop pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093 Approved by: https://github.com/oulgen, https://github.com/lezcano	2023-10-27 18:23:49 +00:00
PyTorch MergeBot	64fd027f2e	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `73cc5d1cdd`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))	2023-10-27 01:40:06 +00:00
PyTorch MergeBot	0a3199dd7e	Revert "Readded device_assert skipping in index and index_put (and also added (#112093 )" This reverts commit `e38347f490`. Reverted https://github.com/pytorch/pytorch/pull/112093 on behalf of https://github.com/izaitsevfb due to Sorry, trying to resolve a conflict with intern, and unblock the revert of #108690 ([comment](https://github.com/pytorch/pytorch/pull/112093#issuecomment-1782154814))	2023-10-27 01:37:33 +00:00
Shunting Zhang	73cc5d1cdd	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 22:18:37 +00:00
PyTorch MergeBot	485cc0faae	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `ec0cdcdf6a`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))	2023-10-26 16:41:20 +00:00
chilli	e38347f490	Readded device_assert skipping in index and index_put (and also added (#112093 ) copy to noop pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112093 Approved by: https://github.com/oulgen, https://github.com/lezcano ghstack dependencies: #111990	2023-10-26 07:54:44 +00:00
Shunting Zhang	ec0cdcdf6a	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 04:14:22 +00:00
Guilherme Leobas	f97c2dabd9	Move negative index checking to common.py - Fix issue 97365 (#108690 ) Fixes https://github.com/pytorch/pytorch/issues/97365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108690 Approved by: https://github.com/lezcano	2023-10-24 17:27:54 +00:00
Jiong Gong	8bc04f46fe	[inductor cpp] use c10::bit_cast to avoid violating strict-aliasing (#110809 ) Fix https://github.com/pytorch/pytorch/issues/110807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110809 Approved by: https://github.com/jansel	2023-10-10 11:16:31 +00:00
Peter Bell	dc794ec32c	[dynamo] Trace through builtin `abs` (#110398 ) In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do the same in dynamo. This also adds `SymNode.__abs__` so we can trace through indexing expressions involving `abs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398 Approved by: https://github.com/jansel, https://github.com/lezcano	2023-10-03 19:25:37 +00:00
Alexander Grund	e0348ceceb	Avoid undefined behavior in JIT-generated conversion code (#110212 ) The inductor/dynamo JIT generator creates C++ code using `static_cast` for type conversions. This is can be undefined behavior for e.g. `static_cast<uint8_t>(floatVal)` where `floatVal` is a negative value. To avoid this in the "regular" C++ code `c10::convert` is used. So use it in the JIT generated code too. Fixes #110077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110212 Approved by: https://github.com/ezyang, https://github.com/jgong5, https://github.com/desertfire	2023-10-02 12:56:41 +00:00
leslie-fang-intel	7eeb392eb3	[Inductor] Enable the item() and nonzero() codegen test on CPU (#110262 ) Summary Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes: - Current implementation of `rename_indexing` `10c646295d/torch/_inductor/codegen/common.py (L1023)` only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`. - Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition. Test Plan ``` python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262 Approved by: https://github.com/ezyang, https://github.com/jgong5	2023-09-30 00:13:20 +00:00
Edward Z. Yang	d1a13129bb	Add support for item() and nonzero() codegen in Inductor (#109893 ) This is another version of https://github.com/pytorch/pytorch/pull/109262 that I think is more harmonious with inductor design. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109893 Approved by: https://github.com/jansel	2023-09-28 23:37:31 +00:00
Sam Larsen	7ed06e8317	[inductor] enable mypy checking in torch/_inductor/codegen/cpp.py (#109729 ) Summary: Add enough typehints / ignores to enable mypy checking in torch/_inductor/codegen/cpp.py Test Plan: lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/109729 Approved by: https://github.com/Skylion007	2023-09-25 22:53:05 +00:00
Ying Zhang	bbdce93571	Basic fp8 support in Inductor (#109168 ) Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from `10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10)`. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168 Approved by: https://github.com/drisspg	2023-09-23 04:41:41 +00:00
Nikita Shulga	a9bf1031d4	[BE] Do not use `numpy` in `torch._inductor.codegen.cpp` (#109324 ) `s/numpy.iinfo(numpy.int32)/torch.iinfo(torch.int32)/` as those two are interchangeable Partially addresses https://github.com/pytorch/pytorch/issues/109387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109324 Approved by: https://github.com/albanD	2023-09-15 17:29:10 +00:00
Ying Zhang	097fd43f8c	[Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931 ) This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901	2023-09-12 17:44:38 +00:00
Sherlock Huang	b9dfdc091b	[AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279 ) (#108350 ) Summary: This is a prototype for running extern fallback kernels with a host side proxy executor. Sample of generated cpp wrapper call: ``` at::Tensor buf0; // output buffer void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0}; int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81}; proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0); ``` - In my current implementation, proxy executor interprets the raw pointers according to the ops schema. This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption) - I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels. - These are all the input argument types I have support so far. union Argument { # Bool value does not matter 1: bool asNone; 2: TensorArgument asTensor; 3: list<TensorArgument> asTensors; 5: i64 asInt; 7: list<i64> asInts; 8: double asFloat; 9: list<double> asFloats; 10: string asString; 10.5: list<string> asStrings; 11: SymIntArgument asSymInt; 12: list<SymIntArgument> asSymInts; 13: ScalarType asScalarType; 14: MemoryFormat asMemoryFormat; 15: Layout asLayout; 16: Device asDevice; 17: bool asBool; 18: list<bool> asBools; } - Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC implications. 1. requires exported fx graph to explicitly populate default values, if users doesn't specify. 2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify. 3. Proxy executor look up from opSchema for default values. For fixing T162112344 Test Plan: frontend: buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main test: buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops backend: buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)' Reviewed By: suo Differential Revision: D48747417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350 Approved by: https://github.com/izaitsevfb	2023-09-02 17:14:10 +00:00
Shunting Zhang	7cb4bf675b	[inductor] no-side-effect codegen (#107617 ) Inductor kernel codegen previously have the following side effect: - in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers - during codegen, we do memory allocation/free. These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617 Approved by: https://github.com/jansel	2023-08-31 00:25:17 +00:00
leslie-fang-intel	fdbc2ec5cb	[Quant][Inductor] Fix the non contiguous load with uint8 data type (#106958 ) Summary Currently, the load vectorization code generation with `non_contiguous` and `uint8` data type has issue in determining the data type. It caused wrong results in `shufflenet_v2_x1_0` model after we enable the `cat` quantization recipe. - Previously code gen with the example in this PR: ``` cpp_fused_clone_view_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const unsigned char* in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(56) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L)) { auto tmp0 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = flag_to_float_scalar(in_ptr0[static_cast<long>((116L(static_cast<long>(i0) % static_cast<long>(2L))) + (232Li1) + (232Li1_inner) + (at::native::div_floor_integer(i0, 2L)))]); return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })(); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp5 = tmp3 tmp4; auto tmp6 = tmp5 * tmp4; auto tmp7 = tmp6.round(); auto tmp8 = tmp7 + tmp2; auto tmp9 = at::vec::maximum(tmp8, tmp2); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp11 = at::vec::minimum(tmp9, tmp10); auto tmp12 = at::vec::convert_float_to_uint8(tmp11); auto tmp13 = at::vec::convert_uint8_to_float(tmp12); auto tmp14 = tmp13 - tmp2; auto tmp15 = tmp14 * tmp4; tmp15.store(out_ptr0 + static_cast<long>(i1 + (784Li0))); } } } } } ''') ``` - After this PR, the code gen is: ``` cpp_fused_clone_view_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const unsigned char in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(56) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L)) { auto tmp0 = ([&]() { __at_align__ unsigned char tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((116L(static_cast<long>(i0) % static_cast<long>(2L))) + (232Li1) + (232Li1_inner) + (at::native::div_floor_integer(i0, 2L)))]; return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })(); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp5 = tmp3 tmp4; auto tmp6 = tmp5 * tmp4; auto tmp7 = tmp6.round(); auto tmp8 = tmp7 + tmp2; auto tmp9 = at::vec::maximum(tmp8, tmp2); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp11 = at::vec::minimum(tmp9, tmp10); auto tmp12 = at::vec::convert_float_to_uint8(tmp11); auto tmp13 = at::vec::convert_uint8_to_float(tmp12); auto tmp14 = tmp13 - tmp2; auto tmp15 = tmp14 * tmp4; tmp15.store(out_ptr0 + static_cast<long>(i1 + (784Li0))); } } } } } ''') ``` Test Plan* ``` clear && python -m pytest test_cpu_repro.py -k test_non_contiguous_load_buf_quant ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106958 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #106836, #106838	2023-08-26 16:58:45 +00:00
XiaobingSuper	d2105a8688	inductor: support masked load for cpu path (#107670 ) For max_pooling code: ``` #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(56L); i2+=static_cast<long>(1L)) { for(long i3=static_cast<long>(0L); i3<static_cast<long>(64L); i3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2Li1))); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp2 = to_float_mask(tmp0 >= tmp1); auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(112)); auto tmp4 = to_float_mask(tmp0 < tmp3); auto tmp5 = tmp2 & tmp4; auto tmp6 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2Li2))); auto tmp7 = to_float_mask(tmp6 >= tmp1); auto tmp8 = to_float_mask(tmp6 < tmp3); auto tmp9 = tmp7 & tmp8; auto tmp10 = tmp5 & tmp9; auto tmp11 = [&] { auto tmp12 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>((-7232L) + i3 + (128Li2) + (14336Li1) + (802816L*i0)), 16); load auto tmp13 = cvt_lowp_fp_to_fp32<bfloat16>(tmp12); return tmp13; } ; auto tmp14 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10)); ``` the index of ```tmp12 ``` may be a correct index, such as ```i1=0, i2=0, i3=0```, the index is ```-7232L```, it is not a valid index. We may meet segmentation fault error when we call ```tmp11()```, the original behavior is that only the ```tmp10```(index check variable) is true, we can safely get the value, this PR will support masked_load to fixing this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107670 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-25 21:11:09 +00:00
lezcano	2b6249e209	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-23 11:59:20 +00:00
XiaobingSuper	610f64d72a	inductor: also check index_exp when select tiling var (#106765 ) For select tiling var, currently, we only consider load and store which do not consider index exp, and meet accuracy issues: before(the index exp ```i1-1``` can not be vectrized): ``` cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + i1)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp2 = to_float_mask(tmp0 >= tmp1); auto tmp3 = [&] { auto tmp4 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (8Li1_inner) + (25088Li0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })(); auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))); auto tmp6 = tmp4 tmp5; return tmp6; } ; auto tmp7 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2)); { __at_align__ float tmpbuf[16sizeof(float)/sizeof(float)]; tmp7.store(tmpbuf); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>(i2 + (8Li1) + (8Li1_inner) + (25096Li0))] = tmpbuf[i1_inner]; } } } #pragma GCC ivdep for(long i1=static_cast<long>(3136L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>((-1L) + i1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (25088Li0))]; auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))]; auto tmp6 = decltype(tmp4)(tmp4 * tmp5); return tmp6; } ; auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); out_ptr0[static_cast<long>(i2 + (8Li1) + (25096Li0))] = tmp7; } } } } } } ``` after: ``` cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L)) { #pragma omp simd simdlen(8) for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>((-1L) + i1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (25088Li0))]; auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))]; auto tmp6 = decltype(tmp4)(tmp4 * tmp5); return tmp6; } ; auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); out_ptr0[static_cast<long>(i2 + (8Li1) + (25096Li0))] = tmp7; } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106765 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-23 07:16:14 +00:00
PyTorch MergeBot	b282787409	Revert "Wrap indirect indexing on CUDA (#105055 )" This reverts commit `85c673e6b2`. Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))	2023-08-22 20:24:41 +00:00
lezcano	85c673e6b2	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-22 01:06:35 +00:00
Peter Bell	18b1c2907d	[inductor] Add ir.WelfordReduction with multiple outputs (#104725 ) This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously. Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725 Approved by: https://github.com/lezcano	2023-08-18 08:18:01 +00:00
Wang, Eikan	9921b48558	Extend Inductor to support the third-party backend (#106874 ) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. `c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)` In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about 57~58 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about 46~47 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about 47~48 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel	2023-08-16 04:11:36 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit `05bd24bb35`. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00
haozhe.zhu	60237ccbdf	fix bf16 constant accuracy (#105827 ) This PR aims to sort out the data type for `constant`. The constant should be promoted to float https://github.com/pytorch/pytorch/pull/105440. So there are serval changes to do: - Data type propagation should propagate constant node to `float` dtype if original dtype is `bfloat16` - We do not need to insert `to_dtype` after the `constant` node, directly init an `fp32` constant is faster. ``` vectorized<bfloat16> tmp(value); vectorized <float> tmp1 = cvt_bf16_fp32(tmp); -> vectorized<float> tmp(value); ``` - move `constant` out of the list for `all operations can support bf16 without converting to fp32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105827 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-03 01:17:50 +00:00
Wang, Eikan	05bd24bb35	Extend Inductor to support the third-party backend (#100706 ) This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done. Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code. - Python wrapper code generation Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions. - Kernel code generation It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions. - [group_fn](`71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64)`) - [flush](`71c4becda7/torch/_inductor/scheduler.py (L1150)`) - [can_fuse_vertical](`71c4becda7/torch/_inductor/scheduler.py (L1006)`) - [can_fuse_horizontal](`71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64)`) - [codegen_template](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._ - [codegen_nodes](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) - [codegen_sync](`71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)`). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._ The third-party backend needs to inherit from the `Scheduling` class and implement these functions. Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706 Approved by: https://github.com/jansel	2023-08-02 05:13:51 +00:00
haozhe.zhu	952021934f	inductor: legalize fp16 (#100857 ) This PR aims to vectorize FP16 for CPU with what BF16 has done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100857 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-27 02:31:40 +00:00
PyTorch MergeBot	dfc9874740	Revert "inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440 )" This reverts commit `18bcf62bbc`. Reverted https://github.com/pytorch/pytorch/pull/105440 on behalf of https://github.com/XiaobingSuper due to introduce core dumped when init bfloat16 zero tensor ([comment](https://github.com/pytorch/pytorch/pull/105440#issuecomment-1643079005))	2023-07-20 03:56:44 +00:00
Justin Chu	cb7a30f656	[BE] Enable ruff's UP rules and autoformat inductor/ (#105431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431 Approved by: https://github.com/albanD	2023-07-19 13:45:00 +00:00
XiaobingSuper	18bcf62bbc	inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440 ) As scalar path, we should also promote half/bfloat16 constant to float for better accuracy, after this PR, the TIMM ```dm_nfnet``` model amp path can be passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105440 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-19 06:53:23 +00:00
XiaobingSuper	4b3c261a2e	inductor: fix issue of vectorization when the store's index is constant value (#105314 ) Fix #104515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105314 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-18 04:54:25 +00:00
lezcano	87a3ed58cb	Fix ranges for range vars (#104987 ) Ranges are inclusive on both ends... We take this chance to delete a stale comment Pull Request resolved: https://github.com/pytorch/pytorch/pull/104987 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-14 13:43:05 +00:00
Peter Bell	66fb83293e	[inductor] Add min/max to index propagation pass (#105020 ) This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing into direct indexing expressions. I also add support to the cpp printer for Min/Max and fix the triton printer to support multi-argument Min/Max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020 Approved by: https://github.com/lezcano	2023-07-12 19:03:01 +00:00
Peter Bell	e80787c8e1	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-08 20:48:29 +00:00
Peter Bell	0ceca92f80	[inductor] Add single pass "var_unnormalized" reduction_type (#102486 ) This is a bit inefficient because it computes the mean and throws it away since ir.Reduction nodes only have 1 output. However, the mean can at least be scheduled into the same loop as the variance now since there is no data dependency. Thus we can take fewer passes over the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-08 20:48:29 +00:00
lezcano	710abc41cc	Implement bound_sympy (#104559 ) The analysis for SymPy expressions was incorrect as, even though it said that the assumption was "smoothness" the assumption was, in fact, that he formula was monotone in every variable. In other words, it was assuming that the derivative does not change signs in any variable (!!). We implement a function that, given bounds on the values of the free symbols of a sympy expression, it gives a bound on a the expression itself. We reshuffle a few things in value_ranges.py to create a `SymPyValueRangeAnalysis` class, but we do not change any code really. The only relevant change in that file is the addition of the `sympy_bound`s function. We do this because we don't want to inadvertently use any fallbacks in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104559 Approved by: https://github.com/eellison	2023-07-07 23:52:14 +00:00

1 2 3 4

200 Commits