pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Wang, Eikan	9921b48558	Extend Inductor to support the third-party backend (#106874 ) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. `c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)` In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about 57~58 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about 46~47 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about 47~48 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel	2023-08-16 04:11:36 +00:00
lezcano	6d899571d6	Simplify sign lowering in triton (#107051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107051 Approved by: https://github.com/peterbell10 ghstack dependencies: #107038, #107039	2023-08-14 21:01:50 +00:00
Shunting Zhang	6696a75ea8	[inductor] make thread order consistent with loop order (#106827 ) I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead. For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels: - before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d - after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551 I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827 Approved by: https://github.com/jansel	2023-08-11 17:05:21 +00:00
Peter Bell	fa65df3745	[inductor] Type triton size arguments in the kernel index_dtype (#106870 ) `JITFunction._key_of` uses the value of the argument to distinguish between i32 and i64, but this fails if the value is used in indexing calculations where the value exceeds `INT_MAX`. Instead, we should use `index_dtype` which means all indexing calculations are performed in the same dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870 Approved by: https://github.com/lezcano ghstack dependencies: #106626	2023-08-10 21:07:25 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit `05bd24bb35`. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00
Wang, Eikan	05bd24bb35	Extend Inductor to support the third-party backend (#100706 ) This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done. Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code. - Python wrapper code generation Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions. - Kernel code generation It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions. - [group_fn](`71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64)`) - [flush](`71c4becda7/torch/_inductor/scheduler.py (L1150)`) - [can_fuse_vertical](`71c4becda7/torch/_inductor/scheduler.py (L1006)`) - [can_fuse_horizontal](`71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64)`) - [codegen_template](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._ - [codegen_nodes](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) - [codegen_sync](`71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)`). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._ The third-party backend needs to inherit from the `Scheduling` class and implement these functions. Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706 Approved by: https://github.com/jansel	2023-08-02 05:13:51 +00:00
Michael Lazos	5cbd3fc412	[Inductor] Fuse non-foreach ops with foreach ops without iterating over all subnodes (#106008 ) Previously, when fusing a single node into a foreach op, the scheduler would iterate over each subnode and check if it can be fused, this PR adds a mapping so that the node to be fused with can be found more quickly by checking dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106008 Approved by: https://github.com/jansel	2023-07-27 21:40:24 +00:00
Jason Ansel	977df45a0f	[inductor] Call render() once for templates (#105987 ) This is more code, but perhaps easier to understand? Both @Chillee and @ipiszy expressed confusion that we rendered templates twice to reach a fixed point. This removes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105987 Approved by: https://github.com/Chillee	2023-07-27 16:34:38 +00:00
Edward Z. Yang	716f37cef8	If we can't statically prove 32-bit indexing OK, only add guard if hint exists (#106004 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106004 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-07-26 16:36:29 +00:00
SherlockNoMad	a44f8894fa	[Inductor] Provenance tracking for wrapper code (#105717 ) Summary: Add comments in wrapper code for better provenance tracking Sample inductor wrapper output: ``` # Source Nodes: [mm_1], Original ATen: [aten.mm] extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1) # Source Nodes: [l__self___linear], Original ATen: [aten.addmm] extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0) ``` in cpp wrapper ``` // Source Nodes: [bmm_1], Original ATen: bmm at::bmm_out(buf0, arg0_1, arg1_1); ``` Test Plan: OSS CI Differential Revision: D47657260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-07-21 23:06:43 +00:00
Shunting Zhang	1e87778552	[inductor] refactor wrapper benchmark code out of utils.py (#105584 ) Refactor wrapper benchmark out of utils.py since 1. utils.py gets too large 2. I plan to add more code to wrapper benchmark for multi-kernel. This is split out from https://github.com/pytorch/pytorch/pull/103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584 Approved by: https://github.com/jansel	2023-07-21 00:01:35 +00:00
David Berard	28d018dafd	[inductor] Implement bucketize() for dependencies.py (#105102 ) dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed. Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined. Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102 Approved by: https://github.com/eellison	2023-07-17 19:15:00 +00:00
Shunting Zhang	8c479d32da	[inuctor][easy] avoid duplicate kernel definitions (#105099 ) When running BertForMaskedLM , I found if I enable the kernel benchmark, essentially identical kernels will be defined once for each call site. The reason is the benchmark harness of those kernels uses different seed_offset for each invocation. We should be safe to just force seed_offset to be 0 so we can deduplicate identical kernel definitions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105099 Approved by: https://github.com/jansel	2023-07-17 05:34:09 +00:00
PyTorch MergeBot	e68cf02420	Revert "[inductor] Implement bucketize() for dependencies.py (#105102 )" This reverts commit `cff5d6a22c`. Reverted https://github.com/pytorch/pytorch/pull/105102 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105102#issuecomment-1637261924))	2023-07-17 01:22:19 +00:00
David Berard	cff5d6a22c	[inductor] Implement bucketize() for dependencies.py (#105102 ) dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed. Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined. Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102 Approved by: https://github.com/eellison	2023-07-14 19:54:06 +00:00
lezcano	c099b7e07a	ValueRange analysis for indirect indexing (#102611 ) We do so by forwarding ValueRange analysis from IR buffers to CSEvars Pull Request resolved: https://github.com/pytorch/pytorch/pull/102611 Approved by: https://github.com/eellison, https://github.com/peterbell10	2023-07-14 13:43:05 +00:00
lezcano	88dcecdf54	Remove unnecessary casting in triton (#104975 ) This used to be necessary before we advanced the pin past https://github.com/openai/triton/pull/1641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104975 Approved by: https://github.com/peterbell10, https://github.com/Chillee	2023-07-14 13:43:05 +00:00
Peter Bell	66fb83293e	[inductor] Add min/max to index propagation pass (#105020 ) This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing into direct indexing expressions. I also add support to the cpp printer for Min/Max and fix the triton printer to support multi-argument Min/Max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020 Approved by: https://github.com/lezcano	2023-07-12 19:03:01 +00:00
Nikita Karetnikov	49a2b72927	[inductor] handle `Min` and `Max` in `TritonPrinter` (#104944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104944 Approved by: https://github.com/ezyang	2023-07-11 17:11:31 +00:00
Edward Z. Yang	6059fea760	Make perf_hint_log report at info level (#104873 ) If you do it at warning, these log messages will get displayed by default, which is not the intended behavior. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104873 Approved by: https://github.com/mlazos	2023-07-10 23:46:34 +00:00
Edward Z. Yang	0300be5b7b	Fix AttributeError("'constexpr' object has no attribute 'type'") (#104831 ) Fixes #104759 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104831 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym	2023-07-10 23:26:42 +00:00
Peter Bell	bcdd4130b4	[inductor] Fix float64 constants in triton codegen (#104830 ) Fixes #101684 Before this change, we get a float constant in triton ``` tmp0 = 0.2 ``` which in triton IR becomes a float32 value ``` %cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf32> ``` After, we get a tensor with explicit type ``` tmp0 = tl.full([1], 0.2, tl.float64) ``` which does generate a float64 in the triton IR ``` %cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf64> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104830 Approved by: https://github.com/lezcano	2023-07-10 19:40:50 +00:00
Peter Bell	e80787c8e1	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-08 20:48:29 +00:00
Peter Bell	0ceca92f80	[inductor] Add single pass "var_unnormalized" reduction_type (#102486 ) This is a bit inefficient because it computes the mean and throws it away since ir.Reduction nodes only have 1 output. However, the mean can at least be scheduled into the same loop as the variance now since there is no data dependency. Thus we can take fewer passes over the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-08 20:48:29 +00:00
David Berard	2df939aaca	[inductor] Update ops.bucketize to take offsets_size as a sympy.Expr (#104756 ) Background/problem: ops.bucketize needs to take a value `offsets_size`, which is the length of the `offsets` tensor. It is used, e.g., for the bounds of the binary search over the `offsets` tensor. The previous implementation of `ops.bucketize` expected `offsets_size` to be a CSEVariable; i.e. we'd pass `offsets_size = ops.index_expr(offsets.get_size()[0])` into `ops.bucketize()`. However, `ops.index_expr` will sometimes broadcast, turning the scalar `offsets_size` into a tensor. That caused errors, because [triton_helpers.bucketize_binary_search](`a2fe6953bc/torch/_inductor/triton_helpers.py (L153-L155)`) expects `offsets_size` to be a scalar. [Link - where the broadcasting happens](`a2fe6953bc/torch/_inductor/codegen/triton.py (L1056)`) Solution (this PR): Instead of passing `offsets_size` into `ops.bucketize` as a CSEVariable, pass in a sympy.Expr. Then, inside ops.bucketize, convert the sympy.Expr into a string that can be used in the generated triton code. Differential Revision: [D47282413](https://our.internmc.facebook.com/intern/diff/D47282413) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104756 Approved by: https://github.com/jansel	2023-07-08 01:08:55 +00:00
David Berard	d8cb80e382	[inductor] If a kernel contains bucketize, try using config with num_elements_per_warp=32 (#104456 ) In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5. Before: ``` Eager 0.30088499188423157 ms PT2 0.9296960234642029 ms ``` After: ``` Eager 0.3011910021305084 ms PT2 0.22977299988269806 ms ``` Differential Revision: [D47237103](https://our.internmc.facebook.com/intern/diff/D47237103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104456 Approved by: https://github.com/eellison	2023-07-07 20:32:41 +00:00
PyTorch MergeBot	8ca63ff9a8	Revert "[inductor] Add single pass "var_unnormalized" reduction_type (#102486 )" This reverts commit `7e098f9559`. Reverted https://github.com/pytorch/pytorch/pull/102486 on behalf of https://github.com/clee2000 due to sorry but this seems to have broken inductor/test_torchinductor.py::CpuTests::test_std_cpu on mac x86 64 machines `7e098f9559` https://github.com/pytorch/pytorch/actions/runs/5479008241/jobs/9981443710 ([comment](https://github.com/pytorch/pytorch/pull/102486#issuecomment-1624739465))	2023-07-07 04:57:20 +00:00
PyTorch MergeBot	1280b19827	Revert "[inductor] Split ops.reduction into reduction and store_reduction (#102737 )" This reverts commit `59b8d5be74`. Reverted https://github.com/pytorch/pytorch/pull/102737 on behalf of https://github.com/clee2000 due to sorry but i need to revert this to revert the other one in the stack ([comment](https://github.com/pytorch/pytorch/pull/102737#issuecomment-1624735108))	2023-07-07 04:53:14 +00:00
Shunting Zhang	a358a9262e	[inductur] coordesc tuner bug fix with no_x_dim kernel (#104692 ) We recently have an optimization to squash x dimension for persistent reduction kernel when we are confident that XBLOCK will always be 1. We need update the code so that coordinate descent tuner does not tune XBLOCK in this case. Test command. Fail before the fix and pass after. ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104692 Approved by: https://github.com/jansel	2023-07-06 17:47:02 +00:00
Peter Bell	59b8d5be74	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-06 16:22:19 +00:00
Peter Bell	7e098f9559	[inductor] Add single pass "var_unnormalized" reduction_type (#102486 ) This is a bit inefficient because it computes the mean and throws it away since ir.Reduction nodes only have 1 output. However, the mean can at least be scheduled into the same loop as the variance now since there is no data dependency. Thus we can take fewer passes over the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-06 00:00:59 +00:00
lezcano	7ae100628e	Move most SymPy functions to their own file (#104556 ) All these are standalone implementations of some functions and they don't depend on anything else, so we better have them under the `_sympy/` folder on their own Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556 Approved by: https://github.com/ezyang	2023-07-04 03:53:48 +00:00
David Berard	e9d2d74f0a	[inductor] Add prims._inductor_bucketize and add lowerings (#104007 ) TL;DR: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize. Background - torch.bucketize: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2]. Implementation: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in `triton_helpers.py`. This PR also adds a new prim `_inductor_bucketize()` for testing purposes and adds lowering for this op. ~~"right": The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. I'm open to switching the behavior of the inductor op - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize).~~ Switched the behavior of the inductor bucketize op to match the torch op * places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior) + current torch.bucketize behavior + table in [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) * places where "right" means "the right boundary of a bucket is closed": + the text description of [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) (observed in #91580) + [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) (which is basically the same op) Performance: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets). As is: ``` Eager 0.30117499828338623 ms PT2 0.9298200011253357 ms ``` But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456): ``` Eager 0.3015420138835907 ms PT2 0.23028500378131866 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104007 Approved by: https://github.com/jansel	2023-07-03 16:52:38 +00:00
Jack Taylor	80ea3422f0	[ROCm] Enable tl.reduce usage on ROCm (#104099 ) Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm https://github.com/pytorch/pytorch/pull/102444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104099 Approved by: https://github.com/peterbell10, https://github.com/malfet	2023-06-27 16:21:32 +00:00
Michael Lazos	3e674b75b1	Allow fusion of epilogue copies with upstream foreach ops (#104018 ) Allow fusion of epilogue copies with foreach kernel scheduler nodes Pull Request resolved: https://github.com/pytorch/pytorch/pull/104018 Approved by: https://github.com/jansel	2023-06-23 21:39:59 +00:00
Peter Bell	d7994dfd07	[inductor] Add triton_helpers.any instead of reusing max (#103974 ) I doubt there's much difference in performance, but this improves readability of the generated code, e.g. ```python tmp8 = triton_helpers.max2(tmp7, 1)[:, None] ``` becomes ```python tmp8 = triton_helpers.any(tmp7, 1)[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103974 Approved by: https://github.com/lezcano	2023-06-22 20:06:21 +00:00
Antoni Viros i Martin	0d653730ce	Refactory bits for the codegen cache (#103452 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103452 Approved by: https://github.com/ezyang	2023-06-22 13:04:22 +00:00
Peter Bell	b1adaa8777	[inductor] Fix no-xdim reductions (#103527 ) Fixes #103481 Normally triton tensors have shape `[XBLOCK, RBLOCK]`, or some variation where the lengths are 1 but the number of dimensions is the same. The `no_x_dim` change in addition to removing the x dimension, also removed the r dimension from certain values such as the results of reductions and the `xindex` variable. This fixes those two cases to correctly produce tensors of shape `[1]`, equivalent to the old shape `[XBLOCK, 1]` with the x-dimension dropped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103527 Approved by: https://github.com/ngimel	2023-06-14 16:32:17 +00:00
Peter Bell	ccf56eca84	[inductor] Fix is_broadcasted (#103514 ) Fixes #103491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103514 Approved by: https://github.com/ngimel	2023-06-14 13:30:48 +00:00
Edward Z. Yang	597e2a11a3	indexing_dtype_strength_reduction more aggressive free_symbols tests (#103470 ) ValueRanges can't handle symbolic bounds. Be a bit more careful about detecting if you try to pass in expressions with free symbols, and fall back to "don't know" range if this occurs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103470 Approved by: https://github.com/eellison	2023-06-13 16:00:41 +00:00
Edward Z. Yang	c3fdfca5da	Always create ShapeEnv, always apply unspec logic (#103302 ) Originally, my goal for this PR was to remove the `dynamic_shapes` tests in torch/_dynamo/variables/builder.py. However, one thing lead to another, and it turns out that it was easiest to do all of the following in one go: * Unconditionally allocate a ShapeEnv, no matter if dynamic_shapes is enabled or not (torch/_dynamo/output_graph.py). There is a small adjustment to export torch/_dynamo/eval_frame.py to account for the fact that a ShapeEnv always exists, even if you're not doing symbolic export. * Remove dynamic_shapes test from unspec logic (torch/_dynamo/variables/builder.py), the original goal * Specialize strides and storage offset if all sizes are dynamic (torch/fx/experimental/symbolic_shapes.py). This is required to deal with unconditional ShapeEnv: if a ShapeEnv exist, fake tensor-ification may choose to allocate symbols. The idea is that with `automatic_dynamic_shapes == False`, Dynamo should never request dynamic sizes, but this invariant was not upheld for nontrivial strides/offset. The rest are just auxiliary fixups from the above: * Workaround bug in FakeTensorProp where sometimes it doesn't return a FakeTensor (torch/fx/passes/fake_tensor_prop.py), see https://github.com/pytorch/pytorch/pull/103395 for follow up * Make ShapeProp correctly handle int inputs (torch/fx/passes/shape_prop.py) * Disable indexing strength reduction if `assume_static_by_default` is False (torch/_inductor/codegen/triton.py) * Fix hf_T5_generate to NOT toggle `assume_static_by_default` if dynamic shapes is not enabled (benchmarks/dynamo/common.py); technically this is not necessary anymore but it's in for safety. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103302 Approved by: https://github.com/voznesenskym	2023-06-12 12:48:28 +00:00
Yanbo Liang	686d7e4c48	[Inductor] Fix x.view(dtype) decomp and make inductor support it (#102920 ) Fixes #99804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102920 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-06-07 17:10:54 +00:00
Edward Z. Yang	f760899864	Teach Triton codegen to generate sqrt (#103084 ) Fixes https://github.com/pytorch/pytorch/issues/100972 I know ngimel doesn't like this sort of fix because we shouldn't actually be computed sqrt at runtime, I'm open to some sort of perf warning saying that we're spending FLOPs weirdly. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103084 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ngimel	2023-06-07 03:03:56 +00:00
Bin Bao	fbbde8df69	[inductor] fix a numel expr codegen issue (#103005 ) Summary: Correctly use pexpr or cexpr for generating symbolic expression during wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103005 Approved by: https://github.com/jansel	2023-06-06 14:08:05 +00:00
Bin Bao	44fdfd3222	[inductor] Support select_algorithm with cpp_wrapper (#103003 ) Summary: This is one step towards getting cpp_wrapper work with max_autotune. Switch to use unique kernel name to cache generated cubin file. This is a copy of https://github.com/pytorch/pytorch/pull/102738 to solve a ghstack issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103003 Approved by: https://github.com/jansel	2023-06-06 14:08:05 +00:00
lezcano	2c2e4d5228	Populate the eviction_policy field for load/store properly (#91316 ) This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91316 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-06-05 13:54:36 +00:00
Shunting Zhang	86c7652503	[inductor] layout optimization for conv (#99773 ) convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773 Approved by: https://github.com/jansel	2023-06-02 21:08:18 +00:00
Aleksandar Samardžić	51e0f9e858	Add missing decompositons/lowerings for logical/bitwise operators (#102566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566 Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5	2023-06-02 14:27:17 +00:00
Peter Bell	2f96981e5a	[inductor] Reduce duplication of reduction combine functions (#99661 ) Currently reduction bodies are duplicated in several different places. This reduces duplication by `combine_fn` definition used in `_unroll_reduction_fn` and using it in the triton codegen. For cpp this also makes better use of `reduction_combine{,_vec}` by using them to generate the `omp declare reduction` line and the `vec_reduce_all` call. For triton the only change is that that the combine step gets spread over two lines, e.g. instead of: ```python _tmp1 = tl.where(rmask & xmask, triton_helpers.maximum(_tmp1, tmp0), _tmp1) ``` we get ```python tmp2 = triton_helpers.maximum(_tmp1, tmp0) _tmp1 = tl.where(rmask & xmask, tmp2, _tmp1) ``` For cpp the only change is that inplace reduction operations are now written as an out-of-place operation and an assignment, e.g. instead if ```cpp omp_out += omp_in ``` we generate ```cpp omp_out = omp_out + omp_in ``` Which is a purely cosmetic change Pull Request resolved: https://github.com/pytorch/pytorch/pull/99661 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-06-01 18:02:17 +00:00
Bin Bao	c58264c3e9	[inductor] Support multiple symbolic numel expr in CudaWrapperCodeGen (#102093 ) Summary: Add a set to avoid generating extra `auto` when seeing the symbolic numel expression for the second time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102093 Approved by: https://github.com/jansel	2023-05-30 16:08:00 +00:00

1 2 3 4

195 Commits