pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
David Berard	070b94dc08	[inductor][BE] split triton_meta and inductor_meta (#111397 ) triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future. This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton. Tests - wait for CI Differential Revision: [D50442547](https://our.internmc.facebook.com/intern/diff/D50442547) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111397 Approved by: https://github.com/shunting314, https://github.com/eellison	2023-10-23 21:38:21 +00:00
Jon Chuang	9c7f464eef	[inductor]: Better debugging of `can_fuse` decisions with `TORCH_LOGS=fusion` (#110415 ) Fixes https://github.com/pytorch/pytorch/issues/110393 Example logs (for adagrad on main). In this case, it clearly identifies device mismatch as a potential red flag, which is indeed the obstacle to adagrad's successful fusion. (see: https://github.com/pytorch/pytorch/pull/110339) ``` [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (1/10): 18 nodes ===== [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 13 possible fusions: [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7)) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf8')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf10')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf12')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf14')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf9')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf11')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf13')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf15')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf25'), SchedulerNode(name='buf33')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf43'), SchedulerNode(name='buf51')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf34'), SchedulerNode(name='buf42')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf16'), SchedulerNode(name='buf24')) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (1/10): fused 18 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (2/10): 5 nodes ===== [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 0 possible fusions: [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (2/10): fused 5 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== fusion complete (2 iterations) ===== ``` CC @jansel @ngimel @mlazos @shunting314 @peterbell10 as code owners Pull Request resolved: https://github.com/pytorch/pytorch/pull/110415 Approved by: https://github.com/mlazos	2023-10-13 00:36:45 +00:00
Jack Taylor	96f616a054	Revert tl.int1 casting change for ROCm to avoid hangs (#110531 ) Seeing hangs on ROCm seemingly after this PR https://github.com/pytorch/pytorch/pull/110388 https://ossci-raw-job-status.s3.amazonaws.com/log/17381916785 `inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_bool Command took >30min, returning 124` Conditionalising out of this while we investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110531 Approved by: https://github.com/peterbell10	2023-10-06 08:53:45 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
Peter Bell	dc794ec32c	[dynamo] Trace through builtin `abs` (#110398 ) In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do the same in dynamo. This also adds `SymNode.__abs__` so we can trace through indexing expressions involving `abs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398 Approved by: https://github.com/jansel, https://github.com/lezcano	2023-10-03 19:25:37 +00:00
Levy Zhao	7f0a659ccc	Script to compare measured (trace) runtimes with estimated runtimes (#108037 ) (#109076 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/1856 Reviewed By: xmfan, xuzhao9 Differential Revision: D48523883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109076 Approved by: https://github.com/xw285cornell	2023-10-03 17:05:35 +00:00
Peter Bell	01b2f25ebd	[inductor] Cast loads from boolean tensors to `tl.int1` (#110388 ) Triton currently loads pointer to `tl.int1` as `tl.int8`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110388 Approved by: https://github.com/lezcano, https://github.com/Skylion007	2023-10-02 22:52:08 +00:00
chilli	13681382d5	Add heuristic for when `evict_first` should be set (and some other minor things) (#108841 ) Example of when the `evict_first` heuristic helps. ``` @torch.compile def f(a, b): return (a * b).sum(dim=-1) N = 512 inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0)) from torch._inductor.utils import do_bench print(do_bench(lambda: f(*inps))) ``` This generates code like this: http://ix.io/4HFs ``` Original: 3.8 ms This PR: 3.54 ms Always `evict_first: 5.4ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-10-01 17:06:12 +00:00
Jon Chuang	6aae636f69	chore(inductor): Simplify `will_fusion_create_cycle` and cleanup to `node.ancestors` (#109976 ) recursive_predecessors == ancestors so rename. Improve comments Simplify `will_fusion_create_cycle` - make it easier to read and add detailed comments. Diagram to illustrate clarification of shortcut. ![Inductor Deep Dive](https://github.com/pytorch/pytorch/assets/9093549/7a30e088-8a33-4a9c-a8a7-81199cd086e2) CC: @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/109976 Approved by: https://github.com/jansel	2023-09-27 20:48:53 +00:00
Peter Bell	92d86cd1ad	[inductor] Fix triton compiler error in multilayer any (#109325 ) Fixes #109196 When we have a split reduction and the tensor is not an even multiple of the split size, we use `ops.masked` to pad to an even multiple. In the case here we generated: ```python tmp5 = tl.where(mask, tmp4, 0) ``` which implicitly promotes our boolean value to `int32`. The fix is to give the default value the same dtype as `result`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109325 Approved by: https://github.com/lezcano	2023-09-26 12:29:29 +00:00
Ying Zhang	bbdce93571	Basic fp8 support in Inductor (#109168 ) Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from `10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10)`. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168 Approved by: https://github.com/drisspg	2023-09-23 04:41:41 +00:00
Edward Z. Yang	3268b039ec	Handle unbacked symints in Triton size hints (#109609 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609 Approved by: https://github.com/yf225	2023-09-22 03:16:53 +00:00
PyTorch MergeBot	169ae7540d	Revert "Handle unbacked symints in Triton size hints (#109609 )" This reverts commit `654731a52b`. Reverted https://github.com/pytorch/pytorch/pull/109609 on behalf of https://github.com/ezyang due to this seems to regress HF perf ([comment](https://github.com/pytorch/pytorch/pull/109609#issuecomment-1729688883))	2023-09-21 14:25:42 +00:00
Edward Z. Yang	654731a52b	Handle unbacked symints in Triton size hints (#109609 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609 Approved by: https://github.com/yf225 ghstack dependencies: #109603	2023-09-20 18:03:54 +00:00
Sam Larsen	85d26f7868	[inductor] Enable mypy checking for torch/_inductor/codegen/triton.py (#109146 ) Summary: enably mypy chcking for torch/_inductor/codegen/triton.py and make the minimum number of fixes / ignores to get the linter to pass Test Plan: `lintrunner -a` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109146 Approved by: https://github.com/peterbell10	2023-09-19 23:01:03 +00:00
PyTorch MergeBot	800c665618	Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 )" This reverts commit `5976a08eea`. Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/peterbell10 due to This combined with #108803 uncovered a triton bug openai/triton#2298 ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1719811113))	2023-09-14 16:58:52 +00:00
Yang Chen	9cd4548f01	AOTInductor dynamic shape (#109012 ) Summary: This PR adds dynamic-shape support for AOTInductor * On the runtime/interface side, we added two structs, StaticDimInfo and DynamicDimInfo, to hold values for static and dynamic dimensions, respectively. Dynamic dimensions are tracked by an unordered map field defined in AOTInductorModelBase. At inference time, the inference run method will assign the current real dimensional value to each dynamic dimension before executing any kernel. * On the CUDA wrapper codegen side, we generate dynamic symbols appropriately for shape computations. We simulate kernel launch grids in the C++ land by re-using the grid functions from the Python world. The returned grid configs, which may contain symbolic expressions, are printed out in their C++ forms via the CppPrinter. Note that when dynamic shapes are involved, we have to compute grid configs for each kernel at runtime in the same way as we do for launching the corresponding Triton kernel. Otherwise, we may end up with memory-access failures or mis-computations caused by invalid indices for fetching or storing data in device memory. Differential Revision: D49100472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012 Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475	2023-09-14 08:00:30 +00:00
Ying Zhang	097fd43f8c	[Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931 ) This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901	2023-09-12 17:44:38 +00:00
Peter Bell	5976a08eea	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation. Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano, https://github.com/atalman	2023-09-11 18:44:10 +00:00
David Berard	ed7f9cac91	[inductor] Add CPU-side profiler event names for templates and foreach kernels (#108449 ) This passes in the descriptive kernel name as part of the triton_meta dict that gets passed to the CachingAutotuner, for foreach kernels and templates. Before: <img width="684" alt="Screenshot 2023-09-01 at 11 56 02 AM" src="https://github.com/pytorch/pytorch/assets/5067123/c14e13fc-0d9e-425a-a08b-613ef42aa264"> After: <img width="562" alt="Screenshot 2023-09-01 at 2 13 00 PM" src="https://github.com/pytorch/pytorch/assets/5067123/551bb9a9-865b-401e-b6e0-8ebbe5431565"> This PR also refactors the "magic strings" (KERNEL_NAME and DESCRIPTIVE_KRNL_NAME) into an enum in utils.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108449 Approved by: https://github.com/jansel	2023-09-09 02:11:13 +00:00
PyTorch MergeBot	8ba23e48fa	Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 )" This reverts commit `53a27021c5`. Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/atalman due to Sorry for reverting your change, but it broke rocm CI ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1710776610))	2023-09-07 21:13:42 +00:00
Peter Bell	53a27021c5	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano	2023-09-07 17:40:45 +00:00
Shunting Zhang	7cb4bf675b	[inductor] no-side-effect codegen (#107617 ) Inductor kernel codegen previously have the following side effect: - in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers - during codegen, we do memory allocation/free. These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617 Approved by: https://github.com/jansel	2023-08-31 00:25:17 +00:00
Shunting Zhang	556bfe7cb5	[inductor] let codegen not rely on node order (#107320 ) We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107320 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-30 02:34:20 +00:00
Elias Ellison	d040d5b9ee	Fix multi output layout error in indexing dtype calculation (#108085 ) Differential Revision: [D48757829](https://our.internmc.facebook.com/intern/diff/D48757829) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108085 Approved by: https://github.com/yanboliang, https://github.com/davidberard98, https://github.com/jansel, https://github.com/peterbell10	2023-08-29 05:43:44 +00:00
Michael Lazos	d4a99631dd	Handle 2D blocking with foreach (#107840 ) Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well. Code when at least one dim matches: [example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196) Code when neither X or Y dim matches: [example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840 Approved by: https://github.com/jansel	2023-08-26 11:02:46 +00:00
Shunting Zhang	95cacb7fa9	[reland][inductor] make thread order consistent with loop order (#107902 ) This PR relands https://github.com/pytorch/pytorch/pull/106827 which get reverted because of causing compilation error for some ads model. Yanbo provide a repro in one of the 14k model ( `pytest ./generated/test_KaiyangZhou_deep_person_reid.py -k test_044`). This is also the model I used to confirm the fix and come up with a unit test. In this model, we call `tritoin_heuristics.triton_config` with size_hints [2048, 2]. Previously this would result in a trition config with XBLOCK=2048 and YBLOCK=2 . But since we change the mapping between size_hints and XYZ dimension, we now generate a triton config with XBLOCK=2 and YBLOCK=2048. This fails compilation since we set max YBLOCK to be 1024. My fix is to make sure we never generate a triton config that exceeds the maximum block size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107902 Approved by: https://github.com/jansel	2023-08-26 02:56:20 +00:00
PyTorch MergeBot	d35d7de60e	Revert "Handle 2D blocking with foreach (#107840 )" This reverts commit `f87ffe473d`. Reverted https://github.com/pytorch/pytorch/pull/107840 on behalf of https://github.com/huydhn due to Sorry for reverting this, but test_2d_blocking is failing in trunk, probably a landrace as PR was green ([comment](https://github.com/pytorch/pytorch/pull/107840#issuecomment-1694009217))	2023-08-25 22:49:15 +00:00
Michael Lazos	f87ffe473d	Handle 2D blocking with foreach (#107840 ) Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well. Code when at least one dim matches: [example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196) Code when neither X or Y dim matches: [example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840 Approved by: https://github.com/jansel	2023-08-25 20:32:36 +00:00
Jackie (Jiaqi) Xu	398f4ae451	Back out "[inductor] make thread order consistent with loop order (#106827 )" (#107796 ) Summary: D48295371 cause batch fusion failure, which will block mc proposals on all mc models. e.g. cmf f470938179 Test Plan: Without revert, f469732293. With revert diff f472266199. Differential Revision: D48610062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107796 Approved by: https://github.com/yanboliang	2023-08-23 18:02:54 +00:00
lezcano	2b6249e209	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-23 11:59:20 +00:00
PyTorch MergeBot	b282787409	Revert "Wrap indirect indexing on CUDA (#105055 )" This reverts commit `85c673e6b2`. Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))	2023-08-22 20:24:41 +00:00
David Berard	ba5eeed4ac	[inductor] Add CPU-side profiler event for triton kernels w/ python wrapper (#106351 ) This allows you to view the original kernel names (e.g. to reference the triton kernel implementation in the python wrapper code / TORCH_COMPILE_DEBUG logs). `torch._inductor.config.unique_kernel_names=True` does this too, but leaving unique_kernel_names=False will increase triton caching. Another benefit to this approach is that we can attach additional information to this profiler event in the future. For example, we could attach input shapes/strides (i.e. record_shapes=True for profiler), or possibly paths to the files where the code was dumped. <img width="435" alt="Screenshot 2023-07-31 at 5 34 25 PM" src="https://github.com/pytorch/pytorch/assets/5067123/839b752f-3907-4f29-9038-9d1822222b45"> ^ in the trace above, the pink "triton_poi_fused_add_cos_sin_0" kernel is the new trace event which is added by this PR. Performance impact: [dashboard run](https://hud.pytorch.org/benchmark/compilers?startTime=Thu%2C%2010%20Aug%202023%2000%3A52%3A06%20GMT&stopTime=Thu%2C%2017%20Aug%202023%2000%3A52%3A06%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/davidberard98/216/orig&lCommit=90c4212a7993c3660e7ea53bcd9d21160be31d1a&rBranch=main&rCommit=35cca799ff42182a1b7f1ee4d0225ee879b7c924). There are some regressions, including a 1.72x -> 1.71x on huggingface and 1.30x -> 1.29x on dynamic; however, locally I can't reproduce the results on any of the individual models (differences look like they are within noise). I think the perf impact is likely < 1% overall. Differential Revision: [D47941809](https://our.internmc.facebook.com/intern/diff/D47941809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106351 Approved by: https://github.com/eellison, https://github.com/albanD ghstack dependencies: #107195	2023-08-22 18:48:30 +00:00
lezcano	85c673e6b2	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-22 01:06:35 +00:00
Peter Bell	59c5424654	[inductor] Improve handling of index_expr with floating point dtypes (#105021 ) I found that the upsample bicubic lowering was generating this line ```python ops.index_expr(0.244094488188976*x0, torch.float32) ``` which is not good because triton's `ops.index_expr` expects integer expressions and dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105021 Approved by: https://github.com/lezcano	2023-08-21 03:09:53 +00:00
Peter Bell	18b1c2907d	[inductor] Add ir.WelfordReduction with multiple outputs (#104725 ) This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously. Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725 Approved by: https://github.com/lezcano	2023-08-18 08:18:01 +00:00
Wang, Eikan	9921b48558	Extend Inductor to support the third-party backend (#106874 ) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. `c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)` In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about 57~58 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about 46~47 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about 47~48 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel	2023-08-16 04:11:36 +00:00
lezcano	6d899571d6	Simplify sign lowering in triton (#107051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107051 Approved by: https://github.com/peterbell10 ghstack dependencies: #107038, #107039	2023-08-14 21:01:50 +00:00
Shunting Zhang	6696a75ea8	[inductor] make thread order consistent with loop order (#106827 ) I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead. For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels: - before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d - after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551 I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827 Approved by: https://github.com/jansel	2023-08-11 17:05:21 +00:00
Peter Bell	fa65df3745	[inductor] Type triton size arguments in the kernel index_dtype (#106870 ) `JITFunction._key_of` uses the value of the argument to distinguish between i32 and i64, but this fails if the value is used in indexing calculations where the value exceeds `INT_MAX`. Instead, we should use `index_dtype` which means all indexing calculations are performed in the same dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870 Approved by: https://github.com/lezcano ghstack dependencies: #106626	2023-08-10 21:07:25 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit `05bd24bb35`. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00
Wang, Eikan	05bd24bb35	Extend Inductor to support the third-party backend (#100706 ) This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done. Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code. - Python wrapper code generation Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions. - Kernel code generation It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions. - [group_fn](`71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64)`) - [flush](`71c4becda7/torch/_inductor/scheduler.py (L1150)`) - [can_fuse_vertical](`71c4becda7/torch/_inductor/scheduler.py (L1006)`) - [can_fuse_horizontal](`71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64)`) - [codegen_template](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._ - [codegen_nodes](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) - [codegen_sync](`71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)`). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._ The third-party backend needs to inherit from the `Scheduling` class and implement these functions. Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706 Approved by: https://github.com/jansel	2023-08-02 05:13:51 +00:00
Michael Lazos	5cbd3fc412	[Inductor] Fuse non-foreach ops with foreach ops without iterating over all subnodes (#106008 ) Previously, when fusing a single node into a foreach op, the scheduler would iterate over each subnode and check if it can be fused, this PR adds a mapping so that the node to be fused with can be found more quickly by checking dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106008 Approved by: https://github.com/jansel	2023-07-27 21:40:24 +00:00
Jason Ansel	977df45a0f	[inductor] Call render() once for templates (#105987 ) This is more code, but perhaps easier to understand? Both @Chillee and @ipiszy expressed confusion that we rendered templates twice to reach a fixed point. This removes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105987 Approved by: https://github.com/Chillee	2023-07-27 16:34:38 +00:00
Edward Z. Yang	716f37cef8	If we can't statically prove 32-bit indexing OK, only add guard if hint exists (#106004 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106004 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-07-26 16:36:29 +00:00
SherlockNoMad	a44f8894fa	[Inductor] Provenance tracking for wrapper code (#105717 ) Summary: Add comments in wrapper code for better provenance tracking Sample inductor wrapper output: ``` # Source Nodes: [mm_1], Original ATen: [aten.mm] extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1) # Source Nodes: [l__self___linear], Original ATen: [aten.addmm] extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0) ``` in cpp wrapper ``` // Source Nodes: [bmm_1], Original ATen: bmm at::bmm_out(buf0, arg0_1, arg1_1); ``` Test Plan: OSS CI Differential Revision: D47657260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-07-21 23:06:43 +00:00
Shunting Zhang	1e87778552	[inductor] refactor wrapper benchmark code out of utils.py (#105584 ) Refactor wrapper benchmark out of utils.py since 1. utils.py gets too large 2. I plan to add more code to wrapper benchmark for multi-kernel. This is split out from https://github.com/pytorch/pytorch/pull/103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584 Approved by: https://github.com/jansel	2023-07-21 00:01:35 +00:00
David Berard	28d018dafd	[inductor] Implement bucketize() for dependencies.py (#105102 ) dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed. Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined. Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102 Approved by: https://github.com/eellison	2023-07-17 19:15:00 +00:00
Shunting Zhang	8c479d32da	[inuctor][easy] avoid duplicate kernel definitions (#105099 ) When running BertForMaskedLM , I found if I enable the kernel benchmark, essentially identical kernels will be defined once for each call site. The reason is the benchmark harness of those kernels uses different seed_offset for each invocation. We should be safe to just force seed_offset to be 0 so we can deduplicate identical kernel definitions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105099 Approved by: https://github.com/jansel	2023-07-17 05:34:09 +00:00
PyTorch MergeBot	e68cf02420	Revert "[inductor] Implement bucketize() for dependencies.py (#105102 )" This reverts commit `cff5d6a22c`. Reverted https://github.com/pytorch/pytorch/pull/105102 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105102#issuecomment-1637261924))	2023-07-17 01:22:19 +00:00

1 2 3 4 5

231 Commits