pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
leslie-fang-intel	73a6a40346	[Inductor][CPP] Fix outer loop fusion buffer removed (#144243 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)` - `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)` Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243 Approved by: https://github.com/jgong5	2025-01-07 01:17:46 +00:00
blzheng	c09bf71bd6	[Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848 ) Fix https://github.com/pytorch/pytorch/issues/143568 Before: ![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003) After: ![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2025-01-03 09:00:43 +00:00
xinan.lin	01034e963c	[AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970 ) Fix #143967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-31 06:28:32 +00:00
leslie-fang-intel	74028cfd0c	[Inductor][CPP] Fix Data Type issue of frexp (#143746 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746 Approved by: https://github.com/jgong5	2024-12-28 06:00:13 +00:00
leslie-fang-intel	607884c9af	[Inductor][CPP] Fix bitwise shift with corner inputs (#143635 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: `29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501)` at these corner inputs. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635 Approved by: https://github.com/jgong5	2024-12-20 13:47:40 +00:00
leslie-fang-intel	00b0210139	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-14 00:27:55 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
eellison	b731ced91f	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-13 04:18:25 +00:00
Tom Ritchford	dc23f1944a	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-12 17:39:14 +00:00
PyTorch MergeBot	cd1b5924d5	Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 )" This reverts commit `79cf8fa751`. Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))	2024-12-12 14:42:55 +00:00
leslie-fang-intel	79cf8fa751	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-12 05:40:48 +00:00
leslie-fang-intel	06075d3d18	[Inductor][CPP] Fix Mask Dtype mismatch (#142103 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103 Approved by: https://github.com/jgong5	2024-12-12 01:21:32 +00:00
PyTorch MergeBot	233853a66f	Revert "Prologue Fusion (#134532 )" This reverts commit `59ab3825e7`. Reverted https://github.com/pytorch/pytorch/pull/134532 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:26 +00:00
PyTorch MergeBot	5c97ac9721	Revert "Remove unused Python variables in torch/[_-a]* (#133492 )" This reverts commit `fda975a7b3`. Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else. The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))	2024-12-11 17:29:12 +00:00
Tom Ritchford	fda975a7b3	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-10 21:48:44 +00:00
eellison	59ab3825e7	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-10 16:25:57 +00:00
Jason Ansel	0367a31401	[inductor] Minor typing changes (#142219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219 Approved by: https://github.com/Skylion007, https://github.com/yanboliang	2024-12-07 17:48:37 +00:00
Mwiza Kunda	f8a64c324e	Broadcast constants on vectorised stores in `CppTile2DKernel` (#140262 ) Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following: ```shell error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char' 61 \| tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8)); \| ^~~~~ ``` This PR adds the required broadcasting. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262 Approved by: https://github.com/jgong5	2024-12-03 09:15:17 +00:00
Aaron Gokaslan	08db735629	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-03 02:50:10 +00:00
PyTorch MergeBot	daa77f3d9f	Revert "[BE]: Update mypy to 1.13.0 (#140808 )" This reverts commit `00134d68af`. Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))	2024-12-02 20:47:43 +00:00
Aaron Gokaslan	00134d68af	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-02 18:47:54 +00:00
Sun, Jiayi	a964f31d7b	[inductor] modify the heuristic for loop split optimization (#137550 ) ### Summary 1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned). 2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body. This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-11-25 09:16:30 +00:00
haozhe.zhu	d0fd42eb3a	[inductor] refine loop split logic (#128812 ) This PR aims to improves parallelization by collapsing vectorized loop. https://github.com/pytorch/pytorch/issues/122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985Lx0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985Lx0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985Lx0))]; out_ptr0[static_cast<long>(x1 + (209985Lx0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985Lx0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985Lx0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985Lx0))]; out_ptr0[static_cast<long>(x1 + (209985Lx0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17Lx2) + (306Lx0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1) + (39Lx1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17Lx2) + (306Lx0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39Lx1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17Lx2) + (306Lx0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17Lx2) + (306Lx0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1) + (39Lx1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128812 Approved by: https://github.com/jgong5	2024-11-25 04:46:07 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
eellison	eff22171d2	Add Current Mask Var To CSE Cache Key (#140838 ) This torch.cat kernel has multiple subblocks which load from the same input. We were incorrectly reusing the mask vars from the first load for the second load. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140838 Approved by: https://github.com/jansel ghstack dependencies: #140841	2024-11-20 00:55:56 +00:00
Valentine233	6fcef86cfa	[inductor] fix the unligned variable ranges issue in fuse node (#138568 ) Fixes #138550. ### Description In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up. ### Solution For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-11-07 01:17:58 +00:00
Aaron Orenstein	06f619d999	typing ir.py - part 2 (#131846 ) See #131852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846 Approved by: https://github.com/eellison ghstack dependencies: #139238	2024-11-06 00:01:15 +00:00
Sun, Jiayi	3337439dc0	[inductor] modify the heuristic for disabling vectorization (#136422 ) Summary Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-11-04 07:33:32 +00:00
Jason Ansel	ed30fa74ab	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-04 04:28:40 +00:00
PyTorch MergeBot	98e11b0021	Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 )" This reverts commit `c53beab377`. Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
Jason Ansel	c53beab377	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-02 03:04:22 +00:00
Scott Wolchok	a3de067975	[PyTorch] Use 128-bit vectors for ARM64 (#137426 ) The correct vector length for ARM64 is 128 bits (16 bytes). We were previously using double this, apparently just because that would be the same length as AVX2. Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655, #138716, #138744	2024-10-26 00:20:35 +00:00
Scott Wolchok	6aa673377b	[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where (#138486 ) In this case, it looks like we expect the body to be a VecMask (unify_mask_base_type is called by where()), but we didn't make it a VecMask. Now we do. Differential Revision: [D64702918](https://our.internmc.facebook.com/intern/diff/D64702918/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138486 Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet	2024-10-24 19:36:41 +00:00
chilli	0e4d42634e	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-14 10:33:43 +00:00
PyTorch MergeBot	41977a0531	Revert "Port Inductor dataclasses to be kw_only (#137768 )" This reverts commit `65d665bae5`. Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))	2024-10-13 22:25:19 +00:00
chilli	65d665bae5	Port Inductor dataclasses to be kw_only (#137768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768 Approved by: https://github.com/ezyang	2024-10-13 14:55:45 +00:00
Bin Bao	c04b35a5ae	[AOTI] Add standalone version of TORCH_CHECK (#136873 ) Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained. Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873 Approved by: https://github.com/albanD, https://github.com/chenyang78	2024-10-08 15:30:01 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `31c0467594`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Wu, Chunyuan	c3fdf587b5	[inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518 ) The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes. When there's no epilogue nodes, the function could return `False` directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #136418	2024-09-25 10:25:07 +00:00
Wu, Chunyuan	44c871c34b	[inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661 ) ## Description Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune. The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same). If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`. This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied. ### Unsupported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: 401408 * d0 + 100352 * d1 + *7168 d2 + 1792 * d3** + 128 * d4 + d5 The load of `buf1` in the epilogue node: 401408 * d0 + 100352 * d1 + *1792 d2 + 25088 * d3** + 128 * d4 + d5 The above two indexes are different. ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): i0, i1, i2, i3, i4, i5 = index tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp2 = tmp0 + tmp1 tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0) tmp4 = tmp2 + tmp3 return tmp4 , ranges=[8, 4, 14, 4, 14, 128], origin_node=clone, origins=OrderedSet([clone]) )) ``` ### Supported epilogue: `buf1` is the template buffer and `buf2` is the epilogue output buffer. The store of `buf2`: d0 + 576 * d1 + 32 * d2 The load of `buf1` in the epilogue node: d0 + 576 * d1 + 32 * d2 The above two indexes are the same. The layout of `buf2` and `buf1` are different though which is handled by the reindexer: `buf1`: `size=[324, 32], stride=[32, 1]` `buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]` ``` CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1])) ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise( 'cpu', torch.bfloat16, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(_frozen_param4, i1) tmp3 = tmp1 * tmp2 tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2) tmp5 = tmp3 + tmp4 tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32) return tmp6 , ranges=[1, 32, 18, 18], origin_node=convert_element_type_4, origins=OrderedSet([add, mul, convert_element_type_4]) )) ``` ## TODO Add the support for fusions when the indexes are different in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-09-24 05:25:28 +00:00
Aaron Orenstein	06909803cc	Existing mypy issues (#136236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2024-09-24 01:02:07 +00:00
Sun, Jiayi	687e5cf8c5	[inductor] Relax the conditions for loop split (#135335 ) Summary This PR Relaxes the conditions for loop split to support dynamic shape cases. Now the conditions that need to be met to apply loop split optimization are as follows: 1. No reduction and no mudular index for all nodes. 2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs. Example: ``` import torch import torch.nn as nn class GN(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m, dynamic=True) with torch.no_grad(): compiled_m(input) ``` Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960s22z0 + 960s2z1 + 960z2 + z3, 'index1': 32z0 + (z3//30), 'index2': 30s22, 'index3': z3, 'index4': 960s2z0((s2*2//s2)) + 960z1((s22//s2)) + 960z2 + z3}`. After loop split `z3` will split to `30z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960s2*2z0 + 960s2z1 + 960z2 + 30z3 + z4, 'index1': 32z0 + z3, 'index2': 30s2*2, 'index3': 30z3 + z4, 'index4': 960s2z0((s22//s2)) + 960z1((s22//s2)) + 960z2 + 30z3 + z4}` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1))))]; auto tmp1 = out_ptr0[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp3 = out_ptr1[static_cast<int64_t>((32Lx0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))]; auto tmp11 = in_ptr1[static_cast<int64_t>(x3)]; auto tmp13 = in_ptr2[static_cast<int64_t>(x3)]; auto tmp2 = decltype(tmp0)(tmp0 - tmp1); auto tmp4 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp5 = c10::convert<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = decltype(tmp2)(tmp2 tmp9); auto tmp12 = decltype(tmp10)(tmp10 * tmp11); auto tmp14 = decltype(tmp12)(tmp12 + tmp13); out_ptr2[static_cast<int64_t>(x3 + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))] = tmp14; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float', 'const int64_t', 'const int64_t'], ''' #include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2, const int64_t ks0, const int64_t ks1) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(8L)))); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1ks1)); x2+=static_cast<int64_t>(1L)) { for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30Lx1) + (960Lx2) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<int64_t>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L)) { for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(16)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(16)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))))); } for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lks1x1) + (960Lx0(static_cast<int64_t>(ks1ks1)))), static_cast<int64_t>(14L)); auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32Lx0))]; auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30Lx3)), static_cast<int64_t>(14L)); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = 30L(static_cast<int64_t>(ks1ks1)); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = tmp4 / tmp6; auto tmp8 = static_cast<float>(1e-05); auto tmp9 = decltype(tmp7)(tmp7 + tmp8); auto tmp10 = 1 / std::sqrt(tmp9); auto tmp11 = at::vec::Vectorized<float>(tmp10); auto tmp12 = tmp3 * tmp11; auto tmp14 = tmp12 * tmp13; auto tmp16 = tmp14 + tmp15; tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30Lx3) + (960Lx2) + (960Lx1(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1)))) + (960Lks1x0(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L)); } } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() s0 = arg2_1 s2 = arg3_1 assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg4_1, (s0, 960, s2, s2), (960(s2s2), 1, 960s2, 960)) buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32s0, 32s0), torch.float32) buf3 = empty_strided_cpu((s0, 960, s2, s2), (960s2((s2s2) // s2), 1, 960((s2*s2) // s2), 960), torch.float32) cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2) del arg0_1 del arg1_1 del arg4_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-09-20 05:42:52 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `e498b02b47`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Bin Bao	ea2ecab15b	[AOTI][reland] Fix assert_function call in cpu autotune template (#135920 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Test Plan: CI Differential Revision: D62500592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920 Approved by: https://github.com/chenyang78	2024-09-13 12:21:57 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
PyTorch MergeBot	0a9d55d2ee	Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086 )" This reverts commit `16c3b8f87c`. Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))	2024-09-10 19:51:16 +00:00
Bin Bao	16c3b8f87c	[AOTI] Fix assert_function call in cpu autotune template (#135086 ) Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086 Approved by: https://github.com/chenyang78, https://github.com/angelayi ghstack dependencies: #134857	2024-09-09 16:54:12 +00:00

1 2 3 4 5 ...

481 Commits