pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
eellison	f32ab3b9e3	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-08-01 04:37:15 +00:00
Peter Bell	260c991e20	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-31 21:32:20 +00:00
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit `13d744464f`. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
Yuzhen Huang	5298acb5c7	Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132065 ) Summary: Original commit changeset: 1d8cfdcef69d Original Phabricator Diff: D54134695 back out: D54134695 Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc Reviewed By: zw2326 Differential Revision: D60397377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065 Approved by: https://github.com/zw2326, https://github.com/qchip	2024-07-29 22:48:29 +00:00
eellison	8b507a922a	Mode to emulate amp numerics (#131595 ) ``` # Mode to emulate pytorch eager numerics for lower precision (fp16, bf16) # Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after # For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts # Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging # to emulate the eager numerics. ``` We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching. in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now. This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595 Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel	2024-07-29 22:42:23 +00:00
PyTorch MergeBot	957a89f56c	Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 )" This reverts commit `03760be271`. Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](`03760be271`) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))	2024-07-29 15:52:08 +00:00
Peter Bell	03760be271	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-29 03:14:13 +00:00
Peter Bell	16cd1aaa1d	[inductor] Improve sort kernel perf (#131719 ) Closes #129507 This makes two changes to the sort kernel: 1. Use int16 for the indices since we only operate on small dims anyway 2. Instead of passing an explicit mask, we pass the rnumel and imply the mask from that which saves an additional reduction in the sort kernel's inner loop. In my benchmarks, this gives enough of a perf improvement to bump up the max rblock to 512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719 Approved by: https://github.com/eellison	2024-07-26 21:56:47 +00:00
Peter Bell	2784b3f1b7	[inductor] Fix split-scan interaction with multi-kernel (#131044 ) This fixes a couple errors that come up when multi-kernel is used with split-scan. 1. The split-scan was being marked as a persistent kernel, which allowed a multi-kernel to be created but this isn't supported. Fix is to never mark split-scan as persistent. 2. Benchmark codegen was not handling WorkspaceArg, and would raise a KeyError during codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044 Approved by: https://github.com/shunting314	2024-07-25 11:36:36 +00:00
Feng Shi	404d640c39	[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. Consolidation with Foreach kernel: 1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode 2) The backend kernel is consolidated into ComboKernel. (Note: this is part 1 which only deals with the 1st case above.) Details: 1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels). 2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it. 2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py 3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. 4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Example: - element wise kernels original Pytorch function: ``` def test_activations(a, b, c): a1 = torch.nn.functional.relu(a) b1 = torch.nn.functional.sigmoid(b) c1 = torch.nn.functional.tanh(c) return a1, b1, c1 ``` combokernel ``` triton_heuristics.pointwise( size_hints=[512], tile_hint=TileHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'fp32', 5: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = triton_helpers.maximum(0, tmp0) tl.store(out_ptr0 + (x0), tmp1, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 400 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x1 = xindex tmp2 = tl.load(in_ptr1 + (x1), xmask) tmp3 = tl.sigmoid(tmp2) tl.store(out_ptr1 + (x1), tmp3, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex tmp4 = tl.load(in_ptr2 + (x2), xmask) tmp5 = libdevice.tanh(tmp4) tl.store(out_ptr2 + (x2), tmp5, xmask) else: pass ``` - reduction kernels Original Pytorch function: ``` def test_reduce(a, b, c): a1 = torch.sum(a, dim=0) b1 = torch.max(b, dim=0) c1 = torch.min(c, dim=0) return a1, b1, c1 ``` Generated combokernal: ``` triton_heuristics.persistent_reduction( size_hints=[32, 32], reduction_hint=ReductionHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'i64', 5: 'fp32', 6: 'i64', 7: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 20 rnumel = 20 RBLOCK_0: tl.constexpr = 32 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_0)[None, :] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (20r1)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0]) tmp3 = tl.where(rmask & xmask, tmp1, float("-inf")) tmp4 = triton_helpers.max2(tmp3, 1)[:, None] tmp6 = tl.broadcast_to(rindex, tmp3.shape) _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1) tmp5 = tmp5_tmp[:, None] tl.store(out_ptr0 + (x0), tmp4, xmask) tl.store(out_ptr1 + (x0), tmp5, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_1: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_1)[None, :] roffset = 0 rmask = rindex < rnumel r3 = rindex x2 = xindex tmp7 = tl.load(in_ptr1 + (x2 + (10r3)), rmask & xmask, other=0.0) tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1]) tmp10 = tl.where(rmask & xmask, tmp8, float("inf")) tmp11 = triton_helpers.min2(tmp10, 1)[:, None] tmp13 = tl.broadcast_to(rindex, tmp10.shape) _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1) tmp12 = tmp12_tmp[:, None] tl.store(out_ptr2 + (x2), tmp11, xmask) tl.store(out_ptr3 + (x2), tmp12, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_2: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_2)[None, :] roffset = 0 rmask = rindex < rnumel r5 = rindex x4 = xindex tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0) tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2]) tmp17 = tl.where(rmask & xmask, tmp15, 0) tmp18 = tl.sum(tmp17, 1)[:, None] tl.store(out_ptr4 + (x4), tmp18, xmask) else: pass ``` Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes. Test Plan: ``` buck2 test mode/dev-nosan caffe2/test/inductor:foreach ``` ``` buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels ``` Differential Revision: D54134695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969 Approved by: https://github.com/mlazos	2024-07-23 17:34:28 +00:00
Yueming Hao	979429ca89	[inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883 ) Fixes #126338 ## Issue Summary When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks. See more great discussions in the #126338 This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations. ## Fix The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation. When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation. ## Example ```python @torch.compile def fn(x, y): x = x.view(torch.float16) y = y.view(torch.float16) + 1 return x @ y x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) fn(x, y) ``` The output code generated before this fix is like the following. ```python triton_poi_fused_add_view_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tl.store(out_ptr0 + (x0), tmp1, xmask) triton_poi_fused_add_view_1... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0) del arg0_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view] triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0) del arg1_1 buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view] extern_kernels.mm(buf0, buf1, out=buf2) ``` As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion. The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth. The following are output code generated after this PR. ```python triton_poi_fused_add_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0) del arg1_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm] extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1) ``` The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too. The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision. `7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)` `7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)` Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire . Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883 Approved by: https://github.com/eellison	2024-07-23 17:31:39 +00:00
eellison	16a2a1aad3	Annotate graph.py (#131400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400 Approved by: https://github.com/shunting314	2024-07-23 07:04:12 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
peaceorwell	6657b14a64	[inductor] Fix the method for checking the variable type of entry.numel (#131026 ) The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer. Co-authored-by: peterbell10 <peterbell10@live.co.uk> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026 Approved by: https://github.com/peterbell10	2024-07-19 22:51:11 +00:00
chilli	31fc5b8966	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-03 02:34:03 +00:00
PyTorch MergeBot	03440a1c13	Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846 )" This reverts commit `badc638eb6`. Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))	2024-07-02 15:25:34 +00:00
chilli	badc638eb6	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-02 09:31:38 +00:00
Jason Ansel	b93bf55b6a	[halide-backend] Add GPU support (#127506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026	2024-06-29 14:06:21 +00:00
Andres Lugo	b9a1c2c991	[ROCm] Enable F8 Inductor Unit tests (#128353 ) First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-26 18:30:43 +00:00
Peter Bell	90d5a6f001	[inductor] Add lowering and codegen for aten.sort (#128458 ) Closes #125633 Benchmarks: \| Shape \| dim \| stable \| compiled \| eager \| speedup \| \|-------------\|-----\|--------\|----------\|---------\|---------\| \| (256, 4096) \| 0 \| False \| 0.73 ms \| 1.26 ms \| 1.7 \| \| (256, 4096) \| 0 \| True \| 0.75 ms \| 1.27 ms \| 1.7 \| \| (4096, 256) \| 1 \| False \| 0.20 ms \| 0.73 ms \| 3.7 \| \| (4096, 256) \| 1 \| True \| 0.21 ms \| 0.73 ms \| 3.5 \| \| (255, 4096) \| 0 \| False \| 1.05 ms \| 1.48 ms \| 1.4 \| \| (255, 4096) \| 0 \| True \| 1.03 ms \| 1.47 ms \| 1.4 \| \| (4096, 255) \| 1 \| False \| 0.52 ms \| 0.98 ms \| 1.9 \| \| (4096, 255) \| 1 \| True \| 0.54 ms \| 1.00 ms \| 1.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458 Approved by: https://github.com/lezcano, https://github.com/eellison	2024-06-26 01:36:39 +00:00
Isuru Fernando	e6bfa2958b	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-25 02:45:02 +00:00
lezcano	53be7ff0e4	Make tl.atomic_add relaxed (#129133 ) We don't use any fancy synchronization within out atomic ops, we just want them to be atomic, so better to have them be relaxed than the default aquire/release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129133 Approved by: https://github.com/peterbell10	2024-06-21 14:49:58 +00:00
Bin Bao	ba92f5277f	[inductor][refactor] Unify the use of generate_kernel_call (#128467 ) Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack. Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467 Approved by: https://github.com/shunting314	2024-06-19 07:47:25 +00:00
Blaine Burton Rister	f1ee3589a1	[Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342 ) Summary Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets. Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be: - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1. - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size. Note that a slice of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`. Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077 Example kernel: ``` triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4096 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK]) tmp1 = tmp0 + tmp0 tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32)) ''', device_str='cuda') ``` Test Plan This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories: - Can we generate strided block pointers for the appropriate shapes? - Powers of 2 - Non-power of 2, but multiple of the maximum block size - Arbitrary leading dimensions, with power of 2 inner dimensions - Weird strides and offsets - Reductions - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo) - Broadcasts (some variables are missing from the indexing expression) - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers? - Unsupported static shapes - Unsupported symbolic shapes - Mixing and matching these cases: - Pointwise and reduction in the same kernel - Sanity check the test harness - Do we raise an exception if the expected number of block pointers and the actual number are different? Follow-ups There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs: - Handle non-divisible shapes - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted. - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton. - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead? Differential Revision: D56739375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-06-16 07:35:57 +00:00
Peter Bell	d7fc871175	[inductor] Improve superfluous mask handling in triton codegen (#128518 ) This takes the logic from `filter_masks` and factors it out into `_has_constant_mask`. I also improve support for `persistent_reduction` kernels by making use of the static RBLOCK value and potentially XBLOCK too in the `no_x_dim` case. I then use this helper when generating the `xmask` and `rmask`, so we can generate them as constants meaning triton can optimize them even if they are included. e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)` before: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) * XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512x0)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = tl.where(rmask & xmask, tmp1, 0) tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0)) tl.store(out_ptr0 + (x0), tmp4, xmask) ``` after: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = tl.full([RBLOCK], True, tl.int1) rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = tl.full([RBLOCK], True, tl.int1) r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0)) tl.store(out_ptr0 + (x0), tmp3, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518 Approved by: https://github.com/lezcano	2024-06-14 17:52:55 +00:00
Isuru Fernando	e397ad6883	Improve codegen for ops.masked in triton (#128054 ) Fixes https://github.com/pytorch/pytorch/issues/127930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-06-14 11:52:56 +00:00
Jason Ansel	c897651392	[inductor] Add BackendFeature gating (#128266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266 Approved by: https://github.com/shunting314	2024-06-13 07:31:51 +00:00
Edward Z. Yang	3964a3ec73	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Reland notes. This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/ It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-09 06:20:25 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Mu-Chu Lee	d9696ea624	[AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574 ) Summary: 1. Integrate NaN and INF checker with existing config, controllable by env var. 2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check. 3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel. Test Plan: Debugging utility. Test and check by existing tests with env var: ``` TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda ``` Reviewed By: ColinPeppler Differential Revision: D57989176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-07 16:46:26 +00:00
PyTorch MergeBot	ac51f782fe	Revert "Complete revamp of float/promotion sympy handling (#126905 )" This reverts commit `2f7cfecd86`. Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))	2024-06-07 16:01:46 +00:00
Aaron Gokaslan	2d47385f0f	[BE]: Enable ruff TCH rules and autofixes for better imports (#127688 ) Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies. Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688 Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet	2024-06-06 16:55:58 +00:00
Edward Z. Yang	2f7cfecd86	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-06 02:29:45 +00:00
PyTorch MergeBot	d5cb5d623a	Revert "Complete revamp of float/promotion sympy handling (#126905 )" This reverts commit `fb696ef3aa`. Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))	2024-06-05 03:57:58 +00:00
Edward Z. Yang	fb696ef3aa	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-04 11:47:32 +00:00
PyTorch MergeBot	d1fad416a8	Revert "Add aten._unsafe_masked_index (#116491 )" This reverts commit `f03f8bc901`. Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))	2024-06-03 15:51:50 +00:00
Isuru Fernando	f03f8bc901	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-03 14:44:03 +00:00
Yueming Hao	2a03bf5a14	[inductor] fix grid z bug for large grid (#127448 ) Fixes #123210 `2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1733-L1753)` If a kernel's y_grid is larger than 65535, it will be split into multiple z grids. The above grad_fn does this split before the kernel launch; however, the computations for yoffset and the y_grid are incorrect. For example, if we have xy numel of `(1XBLOCK, 65537YBLOCK)`, this function will return an [xyz]_grid with (1, 32768, 2). XBLOCK and YBLOCK here are used for the following `get_grid_dim`. Let's use their default values (4, 1024). `2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1734)` [xyz]_grid = (1, 32768, 2) means the workload are divided to two z grids. Because the triton kernel generation still follows xy dimension, one of the exampled generated kernel is shown below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 655371024 xnumel = 14 yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x2 = xindex y0 = yindex % 128 y1 = (yindex // 128) y3 = yindex tmp0 = tl.load(in_ptr0 + (y0 + (128x2) + (512y1)), xmask, eviction_policy='evict_last') tl.store(out_ptr0 + (x2 + (4y3)), tmp0, xmask) ``` For a trition block with xyz index (0, 0, 1), its yoffset and xoffset are both 0s based on the compuation `yoffset = tl.program_id(1) (tl.program_id(2) + 1) * YBLOCK` and `xoffset = tl.program_id(0) * XBLOCK`. So, this triton block will access the very first elements of the input. However, the correct yoffset should be `(y_index + z_index * y_grid ) * YBLOCK` which is the starting position of the 2nd z grid. At the same time, because we used `y_grid = y_grid // div` to compute the maximum number of element in y dimension, the y_grid is 32768. The total y grids is 32768*2 = 65536, which is less than the actual y grids 65537. So, we should use `y_grid = ceildiv(y_grid, div)` to compute the y grid to save the remaining grids. #123210 is not about AOTInductor, the root cause is the triton kernel generated by torchinductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127448 Approved by: https://github.com/eellison	2024-05-31 08:01:34 +00:00
Alex Baden	5d316c81be	[Inductor] Add 0 initialization to Triton masked loads (#127311 ) For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur. Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized. Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called. Fixes #126535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311 Approved by: https://github.com/jansel	2024-05-30 04:50:54 +00:00
lezcano	0fa2c5b049	Fix mask propagation in the presence of where (#125574 ) Before, when calling ops.where, masks were not properly propagated. We now restrict the optimisation to `ops.masked`, which I think it was what the original code intended to do. I'm not 100% sure that even in the masked case this code is not introducing some bugs, but this is a strict improvement over the previous state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125574 Approved by: https://github.com/peterbell10 ghstack dependencies: #114471, #126783	2024-05-29 23:17:41 +00:00
lezcano	8a21532e53	Fix constant propagation pass (#114471 ) This pass was broken in a number of ways, as we were not generating asserts whenever we took it, even though we need to. While doing so, we found that the analysis we were using for choosing whether to generate asserts or not for dynamic shapes was completely broken. Eliminating indirect indexing in this way allows for a number of optimisations. In particular, we can now fuse against these kernels (indirect indexing disallows fusions). The new strategy is as follows: - We always propagate sympy expressions if we can. - If an expression was an indirect_indexing, we call `check_bounds` - We also call `check_bounds` within `CSEProxy.indirect_indexing` - The checks are issued in the buffer where they would go if the were used in a load - This makes them always be codegen'd before the load and stores - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine. We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure that issuing an assert plays well with all kinds of C++ vectorisation. For now, we rely on the logic within `_maybe_evaluate_static` to prove these bounds. This logic is rather limited though. In the future, we might want to rely on Z3 here to be able to prove bounds in a more general way. Supersedes https://github.com/pytorch/pytorch/pull/113068 Fixes https://github.com/pytorch/pytorch/issues/121251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471 Approved by: https://github.com/peterbell10	2024-05-29 09:10:25 +00:00
Peter Bell	6aa5bb1a76	[inductor] Support persistent reductions for dynamic shapes (#126684 ) Currently persistent reductions are only supported when the reduction dimension is static, however we only really need to know that the rnumel is bounded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126684 Approved by: https://github.com/lezcano	2024-05-27 02:30:20 +00:00
Jason Ansel	d5bf3a98db	[inductor] Refactor indexing() into triton.py (#127047 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047 Approved by: https://github.com/shunting314 ghstack dependencies: #126944, #126945	2024-05-24 22:46:20 +00:00
Jason Ansel	92433217cb	[inductor] Misc refactors (#126945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126945 Approved by: https://github.com/shunting314 ghstack dependencies: #126944	2024-05-24 22:46:20 +00:00
Jason Ansel	1b6e3e3bcb	[inductor] Refactor part of IterationRangesEntry into triton.py (#126944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126944 Approved by: https://github.com/shunting314	2024-05-24 22:46:20 +00:00
Shunting Zhang	14c5c753de	[inductor] use smaller RBLOCK for expensive reduction kernels (#126477 ) Triton sometimes uses less registers for more expensive kernel which results in worse perf ( https://github.com/pytorch/pytorch/issues/126463 ). This may make inductor end up with a sub-optimal config. Use a smaller max RBLOCK if the reduction potentially need many registers. Will run perf test.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126477 Approved by: https://github.com/jansel	2024-05-22 22:47:10 +00:00
Aaron Orenstein	e4623de4cf	typing scheduler.py [2/2]: Apply types (#126656 ) Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout. We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656 Approved by: https://github.com/eellison	2024-05-22 20:33:31 +00:00
Bin Bao	b40fb2de59	[AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575 Approved by: https://github.com/chenyang78 ghstack dependencies: #126369	2024-05-21 18:20:20 +00:00

1 2 3 4 5 ...

418 Commits