pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit `13d744464f`. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
Yuzhen Huang	5298acb5c7	Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132065 ) Summary: Original commit changeset: 1d8cfdcef69d Original Phabricator Diff: D54134695 back out: D54134695 Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc Reviewed By: zw2326 Differential Revision: D60397377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065 Approved by: https://github.com/zw2326, https://github.com/qchip	2024-07-29 22:48:29 +00:00
Peter Bell	9ae288f4be	[inductor] Simplify multi-kernel codegen by unifying kernel args (#127724 ) Persistent kernels are sometimes able to remove intermediate buffers that would otherwise be needed for the non-persistent reduction kernel. This makes multi kernel's codegen more complicated as it needs to drop these extra arguments at runtime after selecting the correct kernel to run. Instead, this PR updates the persistent kernel's `must_keep_buffers` so these aren't dropped during codegen so both kernels have the same signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724 Approved by: https://github.com/shunting314 ghstack dependencies: #131044	2024-07-26 00:12:43 +00:00
Yunqiu Guo	059f9fb30b	[BE][inductor] Type annotate `codecache.py` and `config.py` (#131427 ) As title. Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-07-25 05:54:38 +00:00
Feng Shi	404d640c39	[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. Consolidation with Foreach kernel: 1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode 2) The backend kernel is consolidated into ComboKernel. (Note: this is part 1 which only deals with the 1st case above.) Details: 1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels). 2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it. 2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py 3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. 4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Example: - element wise kernels original Pytorch function: ``` def test_activations(a, b, c): a1 = torch.nn.functional.relu(a) b1 = torch.nn.functional.sigmoid(b) c1 = torch.nn.functional.tanh(c) return a1, b1, c1 ``` combokernel ``` triton_heuristics.pointwise( size_hints=[512], tile_hint=TileHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'fp32', 5: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = triton_helpers.maximum(0, tmp0) tl.store(out_ptr0 + (x0), tmp1, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 400 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x1 = xindex tmp2 = tl.load(in_ptr1 + (x1), xmask) tmp3 = tl.sigmoid(tmp2) tl.store(out_ptr1 + (x1), tmp3, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex tmp4 = tl.load(in_ptr2 + (x2), xmask) tmp5 = libdevice.tanh(tmp4) tl.store(out_ptr2 + (x2), tmp5, xmask) else: pass ``` - reduction kernels Original Pytorch function: ``` def test_reduce(a, b, c): a1 = torch.sum(a, dim=0) b1 = torch.max(b, dim=0) c1 = torch.min(c, dim=0) return a1, b1, c1 ``` Generated combokernal: ``` triton_heuristics.persistent_reduction( size_hints=[32, 32], reduction_hint=ReductionHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'i64', 5: 'fp32', 6: 'i64', 7: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 20 rnumel = 20 RBLOCK_0: tl.constexpr = 32 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_0)[None, :] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (20r1)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0]) tmp3 = tl.where(rmask & xmask, tmp1, float("-inf")) tmp4 = triton_helpers.max2(tmp3, 1)[:, None] tmp6 = tl.broadcast_to(rindex, tmp3.shape) _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1) tmp5 = tmp5_tmp[:, None] tl.store(out_ptr0 + (x0), tmp4, xmask) tl.store(out_ptr1 + (x0), tmp5, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_1: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_1)[None, :] roffset = 0 rmask = rindex < rnumel r3 = rindex x2 = xindex tmp7 = tl.load(in_ptr1 + (x2 + (10r3)), rmask & xmask, other=0.0) tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1]) tmp10 = tl.where(rmask & xmask, tmp8, float("inf")) tmp11 = triton_helpers.min2(tmp10, 1)[:, None] tmp13 = tl.broadcast_to(rindex, tmp10.shape) _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1) tmp12 = tmp12_tmp[:, None] tl.store(out_ptr2 + (x2), tmp11, xmask) tl.store(out_ptr3 + (x2), tmp12, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_2: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_2)[None, :] roffset = 0 rmask = rindex < rnumel r5 = rindex x4 = xindex tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0) tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2]) tmp17 = tl.where(rmask & xmask, tmp15, 0) tmp18 = tl.sum(tmp17, 1)[:, None] tl.store(out_ptr4 + (x4), tmp18, xmask) else: pass ``` Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes. Test Plan: ``` buck2 test mode/dev-nosan caffe2/test/inductor:foreach ``` ``` buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels ``` Differential Revision: D54134695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969 Approved by: https://github.com/mlazos	2024-07-23 17:34:28 +00:00
eellison	16a2a1aad3	Annotate graph.py (#131400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400 Approved by: https://github.com/shunting314	2024-07-23 07:04:12 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
Peter Bell	27c2a0d63b	[inductor] Separate Buffer and Operation into two concepts (#130831 ) Resubmit of #128893 Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831 Approved by: https://github.com/lezcano	2024-07-20 02:05:07 +00:00
eellison	e14d1d10ef	Unwrap Identity in prepare indexing (#130967 ) We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](`752c817898/torch/_inductor/codegen/triton.py (L1554)`) which would do wrapping with tl.full. I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64. Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-07-18 00:43:53 +00:00
Colin Peppler	f272e0ab4a	[inductor] support unbacked symint divisors in vars_and_sizes (#130595 ) Scenario: ``` >>> nodes IterationRangesEntry( x2, divisor=192u0 + 192576, length=s1, (xindex//(192u0 + 192576)), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x1, divisor=192, length=u0 + 1003, ModularIndexing(xindex, 192, u0 + 1003), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x0, divisor=1, length=192, ModularIndexing(xindex, 1, 192), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) ``` Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint? Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595 Approved by: https://github.com/aakhundov, https://github.com/ezyang	2024-07-16 16:21:38 +00:00
chilli	f9f85bfc0b	[Inductor] FlexAttention supports partial masking (#130415 ) (#130626 ) This is the new version of https://github.com/pytorch/pytorch/pull/130415 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Approved by: https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626 Approved by: https://github.com/drisspg, https://github.com/yanboliang	2024-07-14 00:37:26 +00:00
PyTorch MergeBot	da030e7add	Revert "[Inductor] FlexAttention supports partial masking (#130415 )" This reverts commit `207564bab1`. Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))	2024-07-12 13:20:18 +00:00
Yanbo Liang	207564bab1	[Inductor] FlexAttention supports partial masking (#130415 ) This is the new version of #130235 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415 Approved by: https://github.com/Chillee	2024-07-12 07:19:28 +00:00
Richard Zou	edf273edf4	Revert some PRs (#130303 ) Summary: Revert https://github.com/pytorch/pytorch/pull/129346 thru https://github.com/pytorch/pytorch/pull/128893 For S430832 Test Plan: Tests Differential Revision: D59503843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303 Approved by: https://github.com/bdhirsh	2024-07-09 14:46:00 +00:00
Peter Bell	fb078c20c1	[inductor] Separate Buffer and Operation into two concepts (#128893 ) Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893 Approved by: https://github.com/lezcano	2024-07-02 23:49:57 +00:00
Peter Bell	90d5a6f001	[inductor] Add lowering and codegen for aten.sort (#128458 ) Closes #125633 Benchmarks: \| Shape \| dim \| stable \| compiled \| eager \| speedup \| \|-------------\|-----\|--------\|----------\|---------\|---------\| \| (256, 4096) \| 0 \| False \| 0.73 ms \| 1.26 ms \| 1.7 \| \| (256, 4096) \| 0 \| True \| 0.75 ms \| 1.27 ms \| 1.7 \| \| (4096, 256) \| 1 \| False \| 0.20 ms \| 0.73 ms \| 3.7 \| \| (4096, 256) \| 1 \| True \| 0.21 ms \| 0.73 ms \| 3.5 \| \| (255, 4096) \| 0 \| False \| 1.05 ms \| 1.48 ms \| 1.4 \| \| (255, 4096) \| 0 \| True \| 1.03 ms \| 1.47 ms \| 1.4 \| \| (4096, 255) \| 1 \| False \| 0.52 ms \| 0.98 ms \| 1.9 \| \| (4096, 255) \| 1 \| True \| 0.54 ms \| 1.00 ms \| 1.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458 Approved by: https://github.com/lezcano, https://github.com/eellison	2024-06-26 01:36:39 +00:00
Jason Ansel	feb3f3ad77	[inductor] Refactors for Halide backend (#129024 ) Pulling these inductor-related refactors out of the larger Halide backend PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129024 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-21 16:53:35 +00:00
Peter Bell	d7fc871175	[inductor] Improve superfluous mask handling in triton codegen (#128518 ) This takes the logic from `filter_masks` and factors it out into `_has_constant_mask`. I also improve support for `persistent_reduction` kernels by making use of the static RBLOCK value and potentially XBLOCK too in the `no_x_dim` case. I then use this helper when generating the `xmask` and `rmask`, so we can generate them as constants meaning triton can optimize them even if they are included. e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)` before: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) * XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512x0)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = tl.where(rmask & xmask, tmp1, 0) tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0)) tl.store(out_ptr0 + (x0), tmp4, xmask) ``` after: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = tl.full([RBLOCK], True, tl.int1) rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = tl.full([RBLOCK], True, tl.int1) r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0)) tl.store(out_ptr0 + (x0), tmp3, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518 Approved by: https://github.com/lezcano	2024-06-14 17:52:55 +00:00
Isuru Fernando	e397ad6883	Improve codegen for ops.masked in triton (#128054 ) Fixes https://github.com/pytorch/pytorch/issues/127930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-06-14 11:52:56 +00:00
Jason Ansel	c897651392	[inductor] Add BackendFeature gating (#128266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266 Approved by: https://github.com/shunting314	2024-06-13 07:31:51 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Shunting Zhang	0c7f4353e5	[inductor] simplify indexing (#127661 ) This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002 We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations: 1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`. 2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b. With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us). Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661 Approved by: https://github.com/jansel	2024-06-07 17:51:30 +00:00
PyTorch MergeBot	23c156cd2d	Revert "[inductor] simplify indexing (#127661 )" This reverts commit `901226ae83`. Reverted https://github.com/pytorch/pytorch/pull/127661 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with https://github.com/pytorch/pytorch/pull/126905 which needs to be reverted, will be relanding it ([comment](https://github.com/pytorch/pytorch/pull/127661#issuecomment-2155115388))	2024-06-07 15:58:36 +00:00
Shunting Zhang	901226ae83	[inductor] simplify indexing (#127661 ) This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002 We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations: 1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`. 2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b. With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us). Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661 Approved by: https://github.com/jansel	2024-06-06 23:57:45 +00:00
Aaron Gokaslan	2d47385f0f	[BE]: Enable ruff TCH rules and autofixes for better imports (#127688 ) Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies. Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688 Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet	2024-06-06 16:55:58 +00:00
chilli	e0fc1ab625	Forward fix for templates + views (#127446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127446 Approved by: https://github.com/eellison	2024-05-30 02:34:35 +00:00
lezcano	8a21532e53	Fix constant propagation pass (#114471 ) This pass was broken in a number of ways, as we were not generating asserts whenever we took it, even though we need to. While doing so, we found that the analysis we were using for choosing whether to generate asserts or not for dynamic shapes was completely broken. Eliminating indirect indexing in this way allows for a number of optimisations. In particular, we can now fuse against these kernels (indirect indexing disallows fusions). The new strategy is as follows: - We always propagate sympy expressions if we can. - If an expression was an indirect_indexing, we call `check_bounds` - We also call `check_bounds` within `CSEProxy.indirect_indexing` - The checks are issued in the buffer where they would go if the were used in a load - This makes them always be codegen'd before the load and stores - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine. We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure that issuing an assert plays well with all kinds of C++ vectorisation. For now, we rely on the logic within `_maybe_evaluate_static` to prove these bounds. This logic is rather limited though. In the future, we might want to rely on Z3 here to be able to prove bounds in a more general way. Supersedes https://github.com/pytorch/pytorch/pull/113068 Fixes https://github.com/pytorch/pytorch/issues/121251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471 Approved by: https://github.com/peterbell10	2024-05-29 09:10:25 +00:00
chilli	ec8b254ef4	Refactored template codegen to explicitly set current body when generating code (#127144 ) The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. also keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: `f8c4c268da/torch/_inductor/codegen/simd.py (L1402)` ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned after the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I remove the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127144 Approved by: https://github.com/jansel	2024-05-28 09:49:13 +00:00
Jason Ansel	d5bf3a98db	[inductor] Refactor indexing() into triton.py (#127047 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047 Approved by: https://github.com/shunting314 ghstack dependencies: #126944, #126945	2024-05-24 22:46:20 +00:00
Jason Ansel	92433217cb	[inductor] Misc refactors (#126945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126945 Approved by: https://github.com/shunting314 ghstack dependencies: #126944	2024-05-24 22:46:20 +00:00
Jason Ansel	1b6e3e3bcb	[inductor] Refactor part of IterationRangesEntry into triton.py (#126944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126944 Approved by: https://github.com/shunting314	2024-05-24 22:46:20 +00:00
Aaron Orenstein	e4623de4cf	typing scheduler.py [2/2]: Apply types (#126656 ) Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout. We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656 Approved by: https://github.com/eellison	2024-05-22 20:33:31 +00:00
Bin Bao	b40fb2de59	[AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575 Approved by: https://github.com/chenyang78 ghstack dependencies: #126369	2024-05-21 18:20:20 +00:00
Jason Ansel	b98decfc38	[halide-backend] Refactor codegen/triton.py into codegen/simd.py (#126415 ) This PR is primarily just moving stuff around. It creates a new common baseclass for TritonCodegen and the (upcoming) HalideCodegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126415 Approved by: https://github.com/shunting314	2024-05-18 02:43:42 +00:00

35 Commits