pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
eellison	fd35be2fd3	TritonTemplate dtype fixes (#141991 ) - Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype - Update dtype propagation to use `type_to_dtype` instead of instantiating tensor - Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991 Approved by: https://github.com/drisspg ghstack dependencies: #139945, #140057, #141495, #141882	2024-12-04 17:24:23 +00:00
PyTorch MergeBot	38d10a1b17	Revert "[Inductor] Represent tiling as a dict (#141751 )" This reverts commit `5deca07c0d`. Reverted https://github.com/pytorch/pytorch/pull/141751 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/141751#issuecomment-2517815899))	2024-12-04 15:43:16 +00:00
Shangdi Yu	7dfb439a2a	Only write predicate once when there are multiple torch.cond (#141528 ) Fixes #140606 TEST PLAN: ``` python test/inductor/test_aot_inductor.py -k cond_share python test/inductor/test_aot_inductor_arrayref.py -k cond_share ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141528 Approved by: https://github.com/desertfire	2024-12-04 01:56:10 +00:00
Bin Bao	a51a048027	[AOTI][refactor] Move stack allocation related configs (#139093 ) Summary: Move allow_stack_allocation and use_minimal_arrayref_interface configs into the aot_inductor subclass. Test Plan: CI Differential Revision: D65064301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139093 Approved by: https://github.com/chenyang78	2024-12-04 00:15:19 +00:00
Mwiza Kunda	f0b33658f8	Dont use constant mask if ynumel potentially overflows ygrids (#139751 ) If (ynumel / YBLOCK) > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751 Approved by: https://github.com/eellison Co-authored-by: George White <georgew@graphcore.ai>	2024-12-03 22:56:18 +00:00
Mwiza Kunda	f8a64c324e	Broadcast constants on vectorised stores in `CppTile2DKernel` (#140262 ) Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following: ```shell error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char' 61 \| tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8)); \| ^~~~~ ``` This PR adds the required broadcasting. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262 Approved by: https://github.com/jgong5	2024-12-03 09:15:17 +00:00
Aaron Gokaslan	08db735629	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-03 02:50:10 +00:00
PyTorch MergeBot	daa77f3d9f	Revert "[BE]: Update mypy to 1.13.0 (#140808 )" This reverts commit `00134d68af`. Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))	2024-12-02 20:47:43 +00:00
Benjamin Glass	54adbbf6b8	cpp_wrapper: Add support for MemoryFormat arguments (#141367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367 Approved by: https://github.com/desertfire	2024-12-02 20:40:24 +00:00
Aaron Gokaslan	00134d68af	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-02 18:47:54 +00:00
leslie-fang-intel	96d2a511ce	[Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor (#141798 ) Summary When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example: ``` V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] class sdpa_score30(torch.nn.Module): V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"): V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs] return arg0_1 V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_score30 = self.sdpa_score30 V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] sdpa_mask30 = self.sdpa_mask30 V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,)); add_276 = sdpa_score30 = sdpa_mask30 = None V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1] getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0]; flex_attention_30 = None ``` We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798 Approved by: https://github.com/jgong5	2024-12-02 07:38:57 +00:00
Adnan Akhundov	f16e08042c	[user triton] Fix grid codegen for configs with empty kwargs (#141824 ) Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824 Approved by: https://github.com/Chillee	2024-12-02 04:17:21 +00:00
Jason Ansel	b2fe1b9409	[inductor] Fix 3d tiling (#141709 ) Fixes #141121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709 Approved by: https://github.com/eellison	2024-12-01 19:47:41 +00:00
Blaine Burton Rister	5deca07c0d	[Inductor] Represent tiling as a dict (#141751 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions. This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`. Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first. # Test plan The existing CI provides good coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751 Approved by: https://github.com/jansel	2024-12-01 09:54:34 +00:00
Blaine Burton Rister	c2fa544472	[Inductor] move block pointer analysis to a new module (#141733 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing. # Test plan Tested by the existing CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733 Approved by: https://github.com/jansel	2024-11-30 23:21:24 +00:00
Blaine Burton Rister	49fde426ba	[Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`. Tested by the existing CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738 Approved by: https://github.com/jansel	2024-11-30 22:38:13 +00:00
Nan Zhang	5aacfa037b	[Inductor] fix broadcast logic for Triton (#141027 ) (#141693 ) Summary: Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction. Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases. Reviewed By: blaine-rister Differential Revision: D65518033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693 Approved by: https://github.com/blaine-rister	2024-11-28 16:38:25 +00:00
eellison	f83361b274	inductor dtype propagation fixes (#141495 ) - Add in upcast_compute_type on creation of new tensors (loads, constants) - Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr. - bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16. - for masked, avoid calling dtype propagation and just use output dtype. Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32. Follow ups: - We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr. - Be more consistent on our index expr dtype printing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang ghstack dependencies: #139945, #140057	2024-11-28 11:39:38 +00:00
PyTorch MergeBot	b33f770574	Revert "[inductor] Fix 3d tiling (#141709 )" This reverts commit `ca9bfa1a38`. Reverted https://github.com/pytorch/pytorch/pull/141709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is one failed test showing up in trunk. It was missed by target determination ([comment](https://github.com/pytorch/pytorch/pull/141709#issuecomment-2505213481))	2024-11-28 03:55:31 +00:00
Jason Ansel	ca9bfa1a38	[inductor] Fix 3d tiling (#141709 ) Fixes #141121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709 Approved by: https://github.com/eellison	2024-11-28 01:34:28 +00:00
Boyuan Feng	17fd53d8e5	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-27 18:51:52 +00:00
eellison	fd553b9817	Add remaining method and tests for dtype propagation (#140057 ) Adds the remaining unimplemented ops as well as an assertion failure if someone adds a new op without a dtype rule. We test all unique pointwise operators registered as lowerings which have an opinfo. There will be some follow ups for this to work well with both `codegen_upcast_to_fp32` as True and False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140057 Approved by: https://github.com/arui-meta, https://github.com/blaine-rister, https://github.com/ezyang ghstack dependencies: #139945	2024-11-27 17:06:44 +00:00
eellison	566ceb3e7e	Refactor dtype propagation (#139945 ) A couple changes. - Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later. - Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache. Tests get added later in the stack when everything is implemented. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang	2024-11-27 16:57:02 +00:00
leslie-fang-intel	aa827e319e	[Inductor][CPP] Extract common functions to be reused in other CPP Template (#141554 ) Summary Extract common internal functions from GEMM Template into public function, so these functions can be reused by the subsequent group GEMM template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141554 Approved by: https://github.com/jgong5	2024-11-27 09:52:18 +00:00
PyTorch MergeBot	65dbd5cc2d	Revert "[Inductor] Inplacing with Donated Buffer (#140113 )" This reverts commit `eecc8e362c`. Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))	2024-11-26 21:20:59 +00:00
Isuru Fernando	44186a0a4e	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-26 18:11:00 +00:00
Yidi Wu	000d4e9d43	[hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181 ) Before the PR: P1683356646 after the pr: P1683356585 Relevant changes: ``` @@ -231,7 +421,8 @@ true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1] del true_graph_0_arg0_1 del true_graph_0_arg1_1 + (buf5[0],) = true_graph_0(true_graph_0_args) - (true_graph_0_buf0,) = true_graph_0(true_graph_0_args) - buf5[0] = true_graph_0_buf0 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = buf4 @@ -239,7 +430,8 @@ false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1] del false_graph_0_arg0_1 del false_graph_0_arg1_1 + (buf5[0],) = false_graph_0(false_graph_0_args) - (false_graph_0_buf0,) = false_graph_0(false_graph_0_args) - buf5[0] = false_graph_0_buf0 del arg2_1 del buf4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181 Approved by: https://github.com/anijain2305 ghstack dependencies: #140334, #141172	2024-11-26 17:32:51 +00:00
Yidi Wu	aae581d921	[hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172 Approved by: https://github.com/Chillee ghstack dependencies: #140334	2024-11-26 17:32:50 +00:00
Boyuan Feng	eecc8e362c	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-26 17:19:50 +00:00
leslie-fang-intel	9d4c0527b3	[Inductor][CPP] Modularize the CPP GEMM Template (#141006 ) Summary Move the common template code, which may be reused in subsequent group GEMM templates, into the standalone sub-templates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141006 Approved by: https://github.com/jgong5	2024-11-26 14:32:40 +00:00
xinan.lin	4742080ed9	[AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire	2024-11-26 11:51:32 +00:00
Colin Peppler	8f5edcb75c	[CUTLASS] Lift shape & stride information as kernel args (#138611 ) Differential Revision: [D64773324](https://our.internmc.facebook.com/intern/diff/D64773324) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138611 Approved by: https://github.com/chenyang78	2024-11-25 17:52:33 +00:00
Sun, Jiayi	a964f31d7b	[inductor] modify the heuristic for loop split optimization (#137550 ) ### Summary 1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned). 2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body. This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256). Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-11-25 09:16:30 +00:00
haozhe.zhu	d0fd42eb3a	[inductor] refine loop split logic (#128812 ) This PR aims to improves parallelization by collapsing vectorized loop. https://github.com/pytorch/pytorch/issues/122281 For such case, the parallel level is only `2`. And the vectorized loop cannot be collapsed. ``` #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985Lx0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985Lx0)), 16); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985Lx0))]; out_ptr0[static_cast<long>(x1 + (209985Lx0))] = tmp0; } } ``` After this PR, we will gen code ``` #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L)) { if (x1 >= 0 && x1 <199984) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985Lx0)), 16); tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985Lx0)), 16); } if (x1 >= 199984 && x1 <199985) { auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985Lx0))]; out_ptr0[static_cast<long>(x1 + (209985Lx0))] = tmp0; } } } ``` ### Highlight For reduction case, we have some side-effect here. For below case, we vectorized `x1` dim and reduction at `x2` dim. ``` #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17Lx2) + (306Lx0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1) + (39Lx1_inner))] = tmpbuf[x1_inner]; } } () ; } } #pragma omp simd simdlen(4) for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17Lx2) + (306Lx0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<int64_t>(x0 + (39Lx1))] = tmp_acc0; } } } ``` After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops. ``` #pragma omp for collapse(2) for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L)) { for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L)) { { float tmp_acc0_arr[8]; ######### need an array to hold acc result for tail part for (int i = 0; i < 8; i++) { tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity(); } float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L)) { { if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17Lx2) + (306Lx0)), 8); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17Lx2) + (306Lx0))]; tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0); } } } } ############### reduction stores if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L))) { [&] { __at_align__ std::array<float, 8> tmpbuf; tmp_acc0_vec.store(tmpbuf.data(), 8); #pragma GCC unroll 8 for (long x1_inner = 0; x1_inner < 8; x1_inner++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1) + (39Lx1_inner))] = tmpbuf[x1_inner]; } } () ; } if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L))) { for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++) { out_ptr1[static_cast<int64_t>(x0 + (39Lx1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)]; } } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128812 Approved by: https://github.com/jgong5	2024-11-25 04:46:07 +00:00
Jason Ansel	995e3079c9	[inductor] Fix for "Failed to find static RBLOCK" (#141434 ) Summary: I expect this to fix https://fb.workplace.com/groups/1075192433118967/permalink/1547962839175255/ Test Plan: Ask poster to confirm fix Differential Revision: D66413828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141434 Approved by: https://github.com/ezyang	2024-11-23 22:08:56 +00:00
PyTorch MergeBot	f23621ec56	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit `c25b201583`. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))	2024-11-22 15:43:39 +00:00
Jason Ansel	3acc6eac49	[inductor] Add typing to ir.py 2 (#140915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140915 Approved by: https://github.com/aorenste	2024-11-22 04:56:54 +00:00
Isuru Fernando	c25b201583	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-22 02:04:36 +00:00
sanchitintel	ca9813ea14	Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case (#140258 ) As suggested by @leslie-fang-intel in `4c83e4e751 (diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537)` - all elements of `B` tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since `B` matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance. Also rectified the size of the dequantized B buffer. Fixes #140208. A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function Pull Request resolved: https://github.com/pytorch/pytorch/pull/140258 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-11-22 01:34:06 +00:00
Jason Ansel	6eca0aee76	[inductor] Refactor ir.Layout into ir.OutputSpec (#140910 ) This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs). Which should make typing easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910 Approved by: https://github.com/ezyang ghstack dependencies: #140895	2024-11-21 20:01:57 +00:00
Colin Peppler	827f2f749e	[CUTLASS] Raise NotImplementedError if X & W aren't FixedLayout (#140985 ) Summary: title Differential Revision: D66131402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140985 Approved by: https://github.com/Skylion007	2024-11-21 19:59:19 +00:00
Sam Ginzburg	a847790400	[inductor] reset to zero support for user defined Triton kernels (#140982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140982 Approved by: https://github.com/aakhundov	2024-11-21 18:53:23 +00:00
PyTorch MergeBot	701e06b643	Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597 )" This reverts commit `aefcdb3c9f`. Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it fails inductor/test_padding in trunk. This is a target determination miss and that failed test was not run in your PR ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2489641453))	2024-11-20 22:13:57 +00:00
Isuru Fernando	aefcdb3c9f	Move Sympy printers to torch/utils/_sympy/printers.py (#140597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-11-20 20:26:49 +00:00
Jason Ansel	808f0f656d	[inductor] Refactor MutableBox to make IRNode typing easier (#140895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-11-20 19:50:46 +00:00
Benjamin Glass	4ffce45100	AOTInductor: properly generate cpp_wrapper runtime assertions (#141050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141050 Approved by: https://github.com/desertfire ghstack dependencies: #141058	2024-11-20 19:17:47 +00:00
Benjamin Glass	5c684503a6	cpp_wrapper: Fix dtype_view wrapping reinterpret_view (#141058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141058 Approved by: https://github.com/desertfire	2024-11-20 19:17:47 +00:00
Aaron Gokaslan	12e95aa4ee	[BE]: Apply PERF401 autofixes from ruff (#140980 ) * Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables. * list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize. * Manually went back and made mypy happy after the change. * Also fixed style lints in files covered by flake8 but not by pyfmt Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-11-20 17:52:07 +00:00
eellison	eff22171d2	Add Current Mask Var To CSE Cache Key (#140838 ) This torch.cat kernel has multiple subblocks which load from the same input. We were incorrectly reusing the mask vars from the first load for the second load. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140838 Approved by: https://github.com/jansel ghstack dependencies: #140841	2024-11-20 00:55:56 +00:00
Henry Tsang	4f2543c31d	[logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198 ) Adding some dynamo timed for the purpose of better understanding AOTI compilation time. Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there. run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level? traces: <img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c"> Differential Revision: D65554141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198 Approved by: https://github.com/desertfire	2024-11-19 18:54:17 +00:00

1 2 3 4 5 ...

1609 Commits