pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	ff7c06a01b	Revert "limit fused kernel num args. (#113131 )" This reverts commit `7b442c2b0a`. Reverted https://github.com/pytorch/pytorch/pull/113131 on behalf of https://github.com/albanD due to Breaks lint on trunk ([comment](https://github.com/pytorch/pytorch/pull/113131#issuecomment-1817548349))	2023-11-18 16:14:08 +00:00
Han, Xu	7b442c2b0a	limit fused kernel num args. (#113131 ) Fixes #97361 When fused kernel more than 1024 parameters, it should throw error from ctypes. Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation. Code change: 1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready. 2. scheduler will check `ready_to_flush` API and help backend flush codegen. 3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131 Approved by: https://github.com/jgong5, https://github.com/mlazos	2023-11-18 03:55:52 +00:00
Jez Ng	4667e20b3f	Delete a bunch of type-ignores (#113990 ) * Replaced `ignore[import]` by mypy config file entries * Removed a bunch of ignores around previously-fixed attr-defined / call-arg issues * Fixed some invalid / undefined types; added a few more type-ignores to squelch the downstream errors this exposed Pull Request resolved: https://github.com/pytorch/pytorch/pull/113990 Approved by: https://github.com/eellison, https://github.com/Skylion007 ghstack dependencies: #113979	2023-11-18 02:48:38 +00:00
Jez Ng	204ec11e6d	[inductor][easy] Fix fusion logging (#113308 ) We should use %s instead of %d as the numel may be sympy Exprs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113308 Approved by: https://github.com/lezcano	2023-11-09 03:19:39 +00:00
Jez Ng	dc63248b76	Make dynamo configs more amenable to static type checking (#112130 ) `install_config_module` makes a regular module into a ConfigModule with extra methods defined on it. mypy thinks those extra methods (or module functions) are undefined since it cannot analyze something so dynamic. As a workaround, I've created a fake module that defines these extra functions, which I import into the config modules during type checking. As part of this change, I've also added more types to config_utils.py and enabled typechecking for torch/_dynamo/config.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130 Approved by: https://github.com/jansel	2023-11-08 21:17:45 +00:00
drisspg	74c24d2367	Fixes a bug in inductor.triton.load (#113047 ) Lettin CI/CD tell me if there is anything wrong with this Original bug: ``` Shell r1 = rindex tmp37 = tl.load(out_ptr2 + (r1 + (8192*x0)), rmask, eviction_policy='evict_first', other=0) ^ AssertionError('cannot cast int32[constexpr[1],constexpr[2048]] to <[1, 2048], fp8e4nv>') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113047 Approved by: https://github.com/Skylion007, https://github.com/ipiszy	2023-11-07 04:06:54 +00:00
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
Shunting Zhang	493ae78201	[inductor] nan-checker (#112091 ) This PR is spilt out of https://github.com/pytorch/pytorch/pull/108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112091 Approved by: https://github.com/eellison, https://github.com/jansel	2023-11-02 02:32:04 +00:00
David Berard	8191fb3e06	[Reland2] [inductor][BE] split triton_meta and inductor_meta (#112351 ) triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future. This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton. Tests - wait for CI Differential Revision: [D50864493](https://our.internmc.facebook.com/intern/diff/D50864493) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112351 Approved by: https://github.com/eellison	2023-11-02 00:40:12 +00:00
Jiong Gong	e061144aaf	[inductor] replace ops.div with ops.truediv (#112243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112243 Approved by: https://github.com/lezcano ghstack dependencies: #112234	2023-11-01 05:50:51 +00:00
Shunting Zhang	a1e222ef02	metric table (#109245 ) In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem. Q: why not log to stdout/err direclty A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful. Q: what's the difference with speedup_indcutor.csv A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful. Example use cases: - As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface: https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 . - To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful. - gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-11-01 02:33:42 +00:00
Shunting Zhang	fbafff3668	[reland][inductor] benchmark fusion (#112450 ) reland https://github.com/pytorch/pytorch/pull/108193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450 Approved by: https://github.com/jansel	2023-10-31 18:17:06 +00:00
PyTorch MergeBot	64fd027f2e	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `73cc5d1cdd`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))	2023-10-27 01:40:06 +00:00
Shunting Zhang	73cc5d1cdd	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 22:18:37 +00:00
PyTorch MergeBot	485cc0faae	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `ec0cdcdf6a`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))	2023-10-26 16:41:20 +00:00
Shunting Zhang	ec0cdcdf6a	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 04:14:22 +00:00
Guilherme Leobas	f97c2dabd9	Move negative index checking to common.py - Fix issue 97365 (#108690 ) Fixes https://github.com/pytorch/pytorch/issues/97365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108690 Approved by: https://github.com/lezcano	2023-10-24 17:27:54 +00:00
PyTorch MergeBot	e62c887bab	Revert "[inductor][BE] split triton_meta and inductor_meta (#111397 )" This reverts commit `070b94dc08`. Reverted https://github.com/pytorch/pytorch/pull/111397 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111397#issuecomment-1776282039))	2023-10-24 00:52:24 +00:00
David Berard	070b94dc08	[inductor][BE] split triton_meta and inductor_meta (#111397 ) triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future. This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton. Tests - wait for CI Differential Revision: [D50442547](https://our.internmc.facebook.com/intern/diff/D50442547) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111397 Approved by: https://github.com/shunting314, https://github.com/eellison	2023-10-23 21:38:21 +00:00
Jon Chuang	9c7f464eef	[inductor]: Better debugging of `can_fuse` decisions with `TORCH_LOGS=fusion` (#110415 ) Fixes https://github.com/pytorch/pytorch/issues/110393 Example logs (for adagrad on main). In this case, it clearly identifies device mismatch as a potential red flag, which is indeed the obstacle to adagrad's successful fusion. (see: https://github.com/pytorch/pytorch/pull/110339) ``` [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (1/10): 18 nodes ===== [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 13 possible fusions: [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7)) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf8')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf10')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf12')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf14')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf9')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf11')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf13')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf15')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf25'), SchedulerNode(name='buf33')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf43'), SchedulerNode(name='buf51')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf34'), SchedulerNode(name='buf42')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf16'), SchedulerNode(name='buf24')) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (1/10): fused 18 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (2/10): 5 nodes ===== [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 0 possible fusions: [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (2/10): fused 5 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== fusion complete (2 iterations) ===== ``` CC @jansel @ngimel @mlazos @shunting314 @peterbell10 as code owners Pull Request resolved: https://github.com/pytorch/pytorch/pull/110415 Approved by: https://github.com/mlazos	2023-10-13 00:36:45 +00:00
Jack Taylor	96f616a054	Revert tl.int1 casting change for ROCm to avoid hangs (#110531 ) Seeing hangs on ROCm seemingly after this PR https://github.com/pytorch/pytorch/pull/110388 https://ossci-raw-job-status.s3.amazonaws.com/log/17381916785 `inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_bool Command took >30min, returning 124` Conditionalising out of this while we investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110531 Approved by: https://github.com/peterbell10	2023-10-06 08:53:45 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
Peter Bell	dc794ec32c	[dynamo] Trace through builtin `abs` (#110398 ) In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do the same in dynamo. This also adds `SymNode.__abs__` so we can trace through indexing expressions involving `abs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398 Approved by: https://github.com/jansel, https://github.com/lezcano	2023-10-03 19:25:37 +00:00
Levy Zhao	7f0a659ccc	Script to compare measured (trace) runtimes with estimated runtimes (#108037 ) (#109076 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/1856 Reviewed By: xmfan, xuzhao9 Differential Revision: D48523883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109076 Approved by: https://github.com/xw285cornell	2023-10-03 17:05:35 +00:00
Peter Bell	01b2f25ebd	[inductor] Cast loads from boolean tensors to `tl.int1` (#110388 ) Triton currently loads pointer to `tl.int1` as `tl.int8`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110388 Approved by: https://github.com/lezcano, https://github.com/Skylion007	2023-10-02 22:52:08 +00:00
chilli	13681382d5	Add heuristic for when `evict_first` should be set (and some other minor things) (#108841 ) Example of when the `evict_first` heuristic helps. ``` @torch.compile def f(a, b): return (a * b).sum(dim=-1) N = 512 inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0)) from torch._inductor.utils import do_bench print(do_bench(lambda: f(*inps))) ``` This generates code like this: http://ix.io/4HFs ``` Original: 3.8 ms This PR: 3.54 ms Always `evict_first: 5.4ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-10-01 17:06:12 +00:00
Jon Chuang	6aae636f69	chore(inductor): Simplify `will_fusion_create_cycle` and cleanup to `node.ancestors` (#109976 ) recursive_predecessors == ancestors so rename. Improve comments Simplify `will_fusion_create_cycle` - make it easier to read and add detailed comments. Diagram to illustrate clarification of shortcut. ![Inductor Deep Dive](https://github.com/pytorch/pytorch/assets/9093549/7a30e088-8a33-4a9c-a8a7-81199cd086e2) CC: @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/109976 Approved by: https://github.com/jansel	2023-09-27 20:48:53 +00:00
Peter Bell	92d86cd1ad	[inductor] Fix triton compiler error in multilayer any (#109325 ) Fixes #109196 When we have a split reduction and the tensor is not an even multiple of the split size, we use `ops.masked` to pad to an even multiple. In the case here we generated: ```python tmp5 = tl.where(mask, tmp4, 0) ``` which implicitly promotes our boolean value to `int32`. The fix is to give the default value the same dtype as `result`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109325 Approved by: https://github.com/lezcano	2023-09-26 12:29:29 +00:00
Ying Zhang	bbdce93571	Basic fp8 support in Inductor (#109168 ) Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from `10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10)`. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168 Approved by: https://github.com/drisspg	2023-09-23 04:41:41 +00:00
Edward Z. Yang	3268b039ec	Handle unbacked symints in Triton size hints (#109609 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609 Approved by: https://github.com/yf225	2023-09-22 03:16:53 +00:00
PyTorch MergeBot	169ae7540d	Revert "Handle unbacked symints in Triton size hints (#109609 )" This reverts commit `654731a52b`. Reverted https://github.com/pytorch/pytorch/pull/109609 on behalf of https://github.com/ezyang due to this seems to regress HF perf ([comment](https://github.com/pytorch/pytorch/pull/109609#issuecomment-1729688883))	2023-09-21 14:25:42 +00:00
Edward Z. Yang	654731a52b	Handle unbacked symints in Triton size hints (#109609 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609 Approved by: https://github.com/yf225 ghstack dependencies: #109603	2023-09-20 18:03:54 +00:00
Sam Larsen	85d26f7868	[inductor] Enable mypy checking for torch/_inductor/codegen/triton.py (#109146 ) Summary: enably mypy chcking for torch/_inductor/codegen/triton.py and make the minimum number of fixes / ignores to get the linter to pass Test Plan: `lintrunner -a` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109146 Approved by: https://github.com/peterbell10	2023-09-19 23:01:03 +00:00
PyTorch MergeBot	800c665618	Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 )" This reverts commit `5976a08eea`. Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/peterbell10 due to This combined with #108803 uncovered a triton bug openai/triton#2298 ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1719811113))	2023-09-14 16:58:52 +00:00
Yang Chen	9cd4548f01	AOTInductor dynamic shape (#109012 ) Summary: This PR adds dynamic-shape support for AOTInductor * On the runtime/interface side, we added two structs, StaticDimInfo and DynamicDimInfo, to hold values for static and dynamic dimensions, respectively. Dynamic dimensions are tracked by an unordered map field defined in AOTInductorModelBase. At inference time, the inference run method will assign the current real dimensional value to each dynamic dimension before executing any kernel. * On the CUDA wrapper codegen side, we generate dynamic symbols appropriately for shape computations. We simulate kernel launch grids in the C++ land by re-using the grid functions from the Python world. The returned grid configs, which may contain symbolic expressions, are printed out in their C++ forms via the CppPrinter. Note that when dynamic shapes are involved, we have to compute grid configs for each kernel at runtime in the same way as we do for launching the corresponding Triton kernel. Otherwise, we may end up with memory-access failures or mis-computations caused by invalid indices for fetching or storing data in device memory. Differential Revision: D49100472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012 Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475	2023-09-14 08:00:30 +00:00
Ying Zhang	097fd43f8c	[Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931 ) This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901	2023-09-12 17:44:38 +00:00
Peter Bell	5976a08eea	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation. Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano, https://github.com/atalman	2023-09-11 18:44:10 +00:00
David Berard	ed7f9cac91	[inductor] Add CPU-side profiler event names for templates and foreach kernels (#108449 ) This passes in the descriptive kernel name as part of the triton_meta dict that gets passed to the CachingAutotuner, for foreach kernels and templates. Before: <img width="684" alt="Screenshot 2023-09-01 at 11 56 02 AM" src="https://github.com/pytorch/pytorch/assets/5067123/c14e13fc-0d9e-425a-a08b-613ef42aa264"> After: <img width="562" alt="Screenshot 2023-09-01 at 2 13 00 PM" src="https://github.com/pytorch/pytorch/assets/5067123/551bb9a9-865b-401e-b6e0-8ebbe5431565"> This PR also refactors the "magic strings" (KERNEL_NAME and DESCRIPTIVE_KRNL_NAME) into an enum in utils.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108449 Approved by: https://github.com/jansel	2023-09-09 02:11:13 +00:00
PyTorch MergeBot	8ba23e48fa	Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 )" This reverts commit `53a27021c5`. Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/atalman due to Sorry for reverting your change, but it broke rocm CI ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1710776610))	2023-09-07 21:13:42 +00:00
Peter Bell	53a27021c5	[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581 Approved by: https://github.com/lezcano	2023-09-07 17:40:45 +00:00
Shunting Zhang	7cb4bf675b	[inductor] no-side-effect codegen (#107617 ) Inductor kernel codegen previously have the following side effect: - in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers - during codegen, we do memory allocation/free. These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617 Approved by: https://github.com/jansel	2023-08-31 00:25:17 +00:00
Shunting Zhang	556bfe7cb5	[inductor] let codegen not rely on node order (#107320 ) We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107320 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-30 02:34:20 +00:00
Elias Ellison	d040d5b9ee	Fix multi output layout error in indexing dtype calculation (#108085 ) Differential Revision: [D48757829](https://our.internmc.facebook.com/intern/diff/D48757829) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108085 Approved by: https://github.com/yanboliang, https://github.com/davidberard98, https://github.com/jansel, https://github.com/peterbell10	2023-08-29 05:43:44 +00:00
Michael Lazos	d4a99631dd	Handle 2D blocking with foreach (#107840 ) Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well. Code when at least one dim matches: [example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196) Code when neither X or Y dim matches: [example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840 Approved by: https://github.com/jansel	2023-08-26 11:02:46 +00:00
Shunting Zhang	95cacb7fa9	[reland][inductor] make thread order consistent with loop order (#107902 ) This PR relands https://github.com/pytorch/pytorch/pull/106827 which get reverted because of causing compilation error for some ads model. Yanbo provide a repro in one of the 14k model ( `pytest ./generated/test_KaiyangZhou_deep_person_reid.py -k test_044`). This is also the model I used to confirm the fix and come up with a unit test. In this model, we call `tritoin_heuristics.triton_config` with size_hints [2048, 2]. Previously this would result in a trition config with XBLOCK=2048 and YBLOCK=2 . But since we change the mapping between size_hints and XYZ dimension, we now generate a triton config with XBLOCK=2 and YBLOCK=2048. This fails compilation since we set max YBLOCK to be 1024. My fix is to make sure we never generate a triton config that exceeds the maximum block size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107902 Approved by: https://github.com/jansel	2023-08-26 02:56:20 +00:00
PyTorch MergeBot	d35d7de60e	Revert "Handle 2D blocking with foreach (#107840 )" This reverts commit `f87ffe473d`. Reverted https://github.com/pytorch/pytorch/pull/107840 on behalf of https://github.com/huydhn due to Sorry for reverting this, but test_2d_blocking is failing in trunk, probably a landrace as PR was green ([comment](https://github.com/pytorch/pytorch/pull/107840#issuecomment-1694009217))	2023-08-25 22:49:15 +00:00
Michael Lazos	f87ffe473d	Handle 2D blocking with foreach (#107840 ) Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well. Code when at least one dim matches: [example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196) Code when neither X or Y dim matches: [example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840 Approved by: https://github.com/jansel	2023-08-25 20:32:36 +00:00
Jackie (Jiaqi) Xu	398f4ae451	Back out "[inductor] make thread order consistent with loop order (#106827 )" (#107796 ) Summary: D48295371 cause batch fusion failure, which will block mc proposals on all mc models. e.g. cmf f470938179 Test Plan: Without revert, f469732293. With revert diff f472266199. Differential Revision: D48610062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107796 Approved by: https://github.com/yanboliang	2023-08-23 18:02:54 +00:00
lezcano	2b6249e209	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-23 11:59:20 +00:00
PyTorch MergeBot	b282787409	Revert "Wrap indirect indexing on CUDA (#105055 )" This reverts commit `85c673e6b2`. Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))	2023-08-22 20:24:41 +00:00

1 2 3 4 5

249 Commits