pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Blaine Burton Rister	a2753e376b	[Inductor] Support tiling reduction dimensions (#137243 ) Fixes #134277 and https://github.com/pytorch/pytorch/issues/142317. Sub-PRs containing refactors from this one: - https://github.com/pytorch/pytorch/pull/141733 - https://github.com/pytorch/pytorch/pull/141738 - https://github.com/pytorch/pytorch/pull/141751 (based off the former) - https://github.com/pytorch/pytorch/pull/142249 - https://github.com/pytorch/pytorch/pull/142020 - https://github.com/pytorch/pytorch/pull/143135 These refactor PRs should land before the main one. # Feature Note: to minimize risk, multi-dimensional reductions are gated by the flag `config.triton.tile_reductions`, which defaults to False. Instead of having a single reduction dimension called `"r"`, we can now support 2D reductions with `"r0_"` and `"r1_"` dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D. Here's an example of a 2D persistent reduction kernel: ``` @triton.jit def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr): xnumel = 1 r0_numel = 15 R0_BLOCK: tl.constexpr = 16 r1_numel = 15 R1_BLOCK: tl.constexpr = 16 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1) r0_index = tl.arange(0, R0_BLOCK)[None, :, None] r0_offset = 0 r0_mask = r0_index < r0_numel r1_index = tl.arange(0, R1_BLOCK)[None, None, :] r1_offset = 0 r1_mask = r1_index < r1_numel rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_0 = r0_index r1_1 = r1_index tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :] tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0) tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK]) tmp5 = tl.sum(tmp4, 1)[:, None, None] tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None) ''', device_str='cuda') ``` There are a few main differences between this kernel and what Inductor would generate without this PR. - Instead of an `r`/`RBLOCK` dimension, we have two reduction dimensions: `r0_`/`R0_BLOCK` and `r1_`/`R1_BLOCK`. - There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (`rindex`, `rnumel`, `RBLOCK`, and `roffset`.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code. - We generate the line `tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])` before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood. Here's an example of a looped reduction: ``` @triton.jit def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr): xnumel = 3 r0_numel = 43 r1_numel = 129 xoffset = tl.program_id(0) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None] xmask = xindex < xnumel r0_base = tl.arange(0, R0_BLOCK)[None, :, None] r1_base = tl.arange(0, R1_BLOCK)[None, None, :] rnumel = r0_numel * r1_numel RBLOCK: tl.constexpr = R0_BLOCKR1_BLOCK rbase = r1_base + (r0_baser1_numel) x0 = xindex block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0]) _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32) for r0_offset in range(0, r0_numel, R0_BLOCK): r0_index = r0_offset + r0_base r0_mask = r0_index < r0_numel for r1_offset in range(0, r1_numel, R1_BLOCK): r1_index = r1_offset + r1_base r1_mask = r1_index < r1_numel roffset = r1_offset + (r0_offsetr1_numel) rindex = r1_index + (r0_indexr1_numel) r0_1 = r0_index r1_2 = r1_index tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first') tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK]) tmp3 = _tmp2 + tmp1 _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2) block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK]) block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)R1_BLOCK((128 + R1_BLOCK) // R1_BLOCK)]) tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK]) tmp2 = tl.sum(tmp4, 1)[:, None, None] tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code: - They calculate indices inside the loop using `r0_base` and `r1_base`. For compatibility with existing codegen, these are collapsed to the 1D variant `rbase`. - Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a `tl.advance` line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new `self.pointer_advancements` field of the kernel, which categorizes advancements by dimension. The biggest difficulty in implementing this feature was that we represented tiling with a tuple like `(5,2)`. In the existing codebase, the compiler can infer that the reduction dimension of `(5,2)` is `2`, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like `{"x": 5, "r0_": 2, "r1_": 4}`. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like `{"x": 5, "y": 5, "r0_": 2, "r1_": 4}`. (This is not supported today, but we might want to do it eventually.) The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions (`"x"`, `"y"`) as is. For reduction kernels, we never tile the `"x"` dimension, and only tile the reduction dimensions (`"r0_"`, `"r1_"`). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files. Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for `argmax` and `var_mean`, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing. To address these cases, this PR adds a new feature to the `config.prefer_nd_tiling` option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which would simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible. # Test plan This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like `config.prefer_nd_tiling` and `config.tile_reductions`, so this really only checks that the PR doesn't break 1D reductions. In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions: - `test_2d_reduction_odd_shapes`: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions. - `test_2d_reduce_no_x_dim`: test 2D reductions with no x dimension. - `test_2d_welford_reduction`: test 2D welford reductions with block pointers. - `test_welford_non_block_pointer`: test a 2D welford reduction when block pointer analysis fails. - `test_reduction_multiple_discontiguous_dims`: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D. - `test_2d_reduction_multi_kernel`: test multi kernel autotuning on a 2D softmax kernel. - `test_enable_tiled_reductions`: test that `config.triton.tile_reductions` enables/disables this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137243 Approved by: https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com> Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-31 05:06:46 +00:00
Boyuan Feng	f3e5078c27	[Inductor] Relax size constraints for re-inplacing (#143884 ) Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing. This PR changes the size constraints to be: - input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str). - it's statically known that 0.99 * input_size <= output_size <= input_size ### Apply on llm.c See the reuse of `buf1`. Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078)) ![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9) After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053)) ![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884 Approved by: https://github.com/eellison	2024-12-31 03:52:47 +00:00
Nikita Shulga	11bb94b7ea	[MPSInductor] Fix index generation for transpose (#143973 ) Alas, PythonPrinter would not work here, not would CppPrinter, so start building MetalPrinter. `pytest test/inductor/test_torchinductor.py -k _mps` score is 474 failed, 277 passed, 32 skipped Before this change: `pytest test/inductor/test_torchinductor.py -k _mps` reported 506 failed, 245 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/143973 Approved by: https://github.com/jansel ghstack dependencies: #143948, #143949	2024-12-31 02:04:50 +00:00
xinan.lin	934eaa503f	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-30 23:51:17 +00:00
Jason Ansel	2da7fb5320	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-30 23:35:11 +00:00
Benjamin Glass	d260bc4476	cpp_wrapper: minimize pybind11 dependency (#143772 ) Only include the parts of `pybind11` that handle GIL management within `cpp_wrapper`. This dramatically improves compilation times by reducing the number of headers we compile. Improvements on my local system are on the order of 2x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143772 Approved by: https://github.com/Skylion007	2024-12-30 20:41:02 +00:00
PyTorch MergeBot	1b0d19a2cb	Revert "[inductor] Make generated kernels deterministic (#143951 )" This reverts commit `79b354ee37`. Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))	2024-12-30 02:06:38 +00:00
Jason Ansel	79b354ee37	[inductor] Make generated kernels deterministic (#143951 ) `"compile_id"` had slipped into our generated Triton code (in the metadata), which will defeat caching because the same kernels generated in a different order would not cache hit with eachother. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951 Approved by: https://github.com/oulgen	2024-12-29 19:53:33 +00:00
Nikita Shulga	d8c3900d80	[Inductor] Implement primitive Metal compiler (#143893 ) Still work in progress, only works for element wise operations. Current implementation could be used to turn something like ```python def f(x): return x[:,::2].sin() + x[:, 1::2].cos() ``` into the following shader ```python # Topologically Sorted Source Nodes: [sin, cos, add], Original ATen: [aten.sin, aten.cos, aten.add] # Source node to ATen node mapping: # add => add # cos => cos # sin => sin # Graph fragment: # %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%slice_2,), kwargs = {}) # %cos : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%slice_4,), kwargs = {}) # %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%sin, %cos), kwargs = {}) mps_lib = torch.mps._compile_shader(""" kernel void kernel_0( device float* out_ptr0, constant float* in_ptr0, uint xindex [[thread_position_in_grid]] ) { int x0 = xindex; auto tmp0 = in_ptr0[2x0]; auto tmp1 = metal::precise::sin(tmp0); auto tmp2 = in_ptr0[2x0 + 1]; auto tmp3 = metal::precise::cos(tmp2); auto tmp4 = tmp1 + tmp3; out_ptr0[x0] = static_cast<float>(tmp4); } """) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143893 Approved by: https://github.com/jansel ghstack dependencies: #143891, #143892	2024-12-28 06:58:32 +00:00
leslie-fang-intel	74028cfd0c	[Inductor][CPP] Fix Data Type issue of frexp (#143746 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746 Approved by: https://github.com/jgong5	2024-12-28 06:00:13 +00:00
Nikita Shulga	4a7cf0dbff	[Inductor] Add MPS device op overrides (#143892 ) Mostly dummy interface as MPS backend currently limited to a single device Pull Request resolved: https://github.com/pytorch/pytorch/pull/143892 Approved by: https://github.com/jansel ghstack dependencies: #143891	2024-12-28 02:11:45 +00:00
Colin Peppler	b54620f40f	[CUTLASS] fix bugs: extra data_ptr() call, wrong size symbol name, bias symbol not added (#143528 ) A few small things in this PR: - fixed a bug where `workspace.data_ptr().data_ptr()` showed up - for SM80 CUTLASS kernels, the symbol size for W.size(1) was never created - for addmm kernels, the ldc bias symbol never showed up Pull Request resolved: https://github.com/pytorch/pytorch/pull/143528 Approved by: https://github.com/henrylhtsang	2024-12-27 23:38:18 +00:00
bobrenjc93	c17d767686	remove allow-untyped-defs from _inductor/codegen/rocm/rocm_template_buffer.py (#143870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143870 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-12-27 23:28:51 +00:00
bobrenjc93	63d6e1f743	remove allow-untyped-defs from _inductor/codegen/aoti_hipify_utils.py (#143916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143916 Approved by: https://github.com/Skylion007	2024-12-27 23:25:37 +00:00
Animesh Jain	969415885d	[inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373 Approved by: https://github.com/eellison	2024-12-27 06:46:09 +00:00
bobrenjc93	d60282c177	remove allow-untyped-defs from _inductor/codegen/cpu_device_op_overrides.py (#143881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143881 Approved by: https://github.com/aorenste	2024-12-27 04:10:47 +00:00
PyTorch MergeBot	cc4e70b7c3	Revert "Use absolute path `path.resolve()` -> `path.absolute()` (#129409 )" This reverts commit `135c7db99d`. Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))	2024-12-26 17:26:06 +00:00
PyTorch MergeBot	844e6108f6	Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 )" This reverts commit `ad750ae320`. Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))	2024-12-24 17:22:57 +00:00
Xuehai Pan	b77406a9ec	[BE][CI] bump `ruff` to 0.8.4 (#143753 ) Changes: 1. Bump `ruff` from 0.7.4 to 0.8.4 2. Change `%`-formatted strings to f-string 3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753 Approved by: https://github.com/Skylion007	2024-12-24 12:24:10 +00:00
Xuehai Pan	135c7db99d	Use absolute path `path.resolve()` -> `path.absolute()` (#129409 ) Changes: 1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()` 2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409 Approved by: https://github.com/albanD	2024-12-24 08:33:08 +00:00
xinan.lin	ad750ae320	[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266 ) This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-12-24 05:42:36 +00:00
cyy	1feae27ed6	[16/N] Fix extra warnings brought by clang-tidy-17 (#143714 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-12-24 03:29:38 +00:00
Kai Londenberg	434e0c2104	Inductor Cutlass backend: Eliminate unused code. (#143723 ) Summary: Eliminates an unused file and some smaller unused code fragments from the inductor cutlass codebase. Test Plan: CI Differential Revision: D67579837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143723 Approved by: https://github.com/ColinPeppler	2024-12-23 09:35:03 +00:00
cyy	09c950cc87	Remove unused <ATen/core/Array.h> inclusion (#143701 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143701 Approved by: https://github.com/albanD	2024-12-22 14:30:11 +00:00
Tom Ritchford	f1cbf4b1b5	Enable ruff's unused variable checking everywhere in pytorch (#136965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-12-22 02:33:11 +00:00
leslie-fang-intel	607884c9af	[Inductor][CPP] Fix bitwise shift with corner inputs (#143635 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: `29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501)` at these corner inputs. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635 Approved by: https://github.com/jgong5	2024-12-20 13:47:40 +00:00
Rachel Guo	9275091d6e	[provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055 ) Summary: This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code: `{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.` Example paste: P1695235000 verified on the test model. See "Test Plan": We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool: https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab) https://pxl.cl/66BzK Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc. Test Plan: test_model_coverage.sh: ``` #!/bin/sh MODEL_ENTITY_ID=644688112 SNAPSHOT_ID=32 MODULE=merge # buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 \| tee output.txt ``` {F1973765026} ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)' ``` ``` TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor ``` Differential Revision: D66967510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055 Approved by: https://github.com/chenyang78	2024-12-18 06:51:50 +00:00
Jason Ansel	cf46eb3bf5	[inductor] Include types and size hints in MultiKernel cache key (#142349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142349 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-12-17 09:26:38 +00:00
bobrenjc93	a42ca5a45b	remove allow-untyped-defs for _inductor/codegen/rocm/rocm_template_buffer.py (#143272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143272 Approved by: https://github.com/aorenste	2024-12-17 05:34:22 +00:00
Bin Bao	467970d683	[AOTI] Relax input alignment assertion (#143236 ) Summary: https://github.com/pytorch/pytorch/pull/142136 added a runtime alignment assertion. But the assumption is probably too strict for more flexible use cases of AOTI, e.g. python deployment, see a recent error torchchat ran into for more details, https://github.com/pytorch/torchchat/actions/runs/12322072267/job/34394851280 . This PR relaxes the runtime check and implements copy_misaligned_inputs in cpp instead. Differential Revision: [D67287922](https://our.internmc.facebook.com/intern/diff/D67287922) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143236 Approved by: https://github.com/malfet, https://github.com/chenyang78	2024-12-17 00:17:39 +00:00
eellison	135a2d4483	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-16 21:46:08 +00:00
PyTorch MergeBot	54ed13cdce	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit `ca973069ed`. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it. breaks an internal test ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2546615951))	2024-12-16 20:05:14 +00:00
eellison	ca973069ed	Update low prec codegen for div/mod (#142350 ) Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350 Approved by: https://github.com/blaine-rister	2024-12-14 03:53:28 +00:00
leslie-fang-intel	00b0210139	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-14 00:27:55 +00:00
eellison	8621b9ff0c	Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402 ) For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics. Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box. We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402 Approved by: https://github.com/jansel ghstack dependencies: #142401	2024-12-13 23:25:15 +00:00
eellison	ad2faec8bb	Add a pass which analyzes whether a prologue preserves zero mask (#142401 ) We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask: ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tl.where(a_mask, tmp1, 0.0) ``` now we do not need to -> ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tmp1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401 Approved by: https://github.com/jansel	2024-12-13 22:37:33 +00:00
Sam Larsen	60c54467db	[logging] Log runtime autotuning timing to scuba (#141919 ) See test plan in internal diff [D66679369](https://our.internmc.facebook.com/intern/diff/D66679369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141919 Approved by: https://github.com/jamesjwu, https://github.com/ezyang	2024-12-13 21:22:13 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
eellison	b731ced91f	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-13 04:18:25 +00:00
eellison	0b75b7ff2b	[Easy] factor out inductor ophandler decompositions (#142400 ) Factor out inductor operator decompositions Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-12-12 19:03:26 +00:00
Tom Ritchford	dc23f1944a	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-12 17:39:14 +00:00
Blaine Burton Rister	520ba556cd	[Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. # Feature This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land. The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR. These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf. # Test plan The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.` to `r0_.`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@meta.com>	2024-12-12 17:22:20 +00:00
eellison	725526abc5	Fix scan dtypes (#143048 ) FIx for https://github.com/pytorch/pytorch/issues/142883. We weren't getting test coverage of scan because the tests were being skipped. see, https://github.com/pytorch/pytorch/issues/143053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143048 Approved by: https://github.com/arui-meta, https://github.com/blaine-rister	2024-12-12 15:57:00 +00:00
PyTorch MergeBot	cd1b5924d5	Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 )" This reverts commit `79cf8fa751`. Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))	2024-12-12 14:42:55 +00:00
leslie-fang-intel	79cf8fa751	[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360 ) Summary Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360 Approved by: https://github.com/jgong5	2024-12-12 05:40:48 +00:00
leslie-fang-intel	06075d3d18	[Inductor][CPP] Fix Mask Dtype mismatch (#142103 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103 Approved by: https://github.com/jgong5	2024-12-12 01:21:32 +00:00
PyTorch MergeBot	233853a66f	Revert "Prologue Fusion (#134532 )" This reverts commit `59ab3825e7`. Reverted https://github.com/pytorch/pytorch/pull/134532 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:26 +00:00
PyTorch MergeBot	f0b80d014d	Revert "Update low prec codegen for div/mod (#142350 )" This reverts commit `1fb3d5a4e3`. Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:26 +00:00
PyTorch MergeBot	829a93562a	Revert "[Easy] factor out inductor ophandler decompositions (#142400 )" This reverts commit `fa746e3eeb`. Reverted https://github.com/pytorch/pytorch/pull/142400 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:26 +00:00
PyTorch MergeBot	9e88279737	Revert "Add a pass which analyzes whether a prologue preserves zero mask (#142401 )" This reverts commit `1a0bd40243`. Reverted https://github.com/pytorch/pytorch/pull/142401 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))	2024-12-11 17:32:25 +00:00

1 2 3 4 5 ...

1706 Commits