pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jason Ansel	637074983e	[inductor] Make load_mask() codegen determinstic (#126017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126017 Approved by: https://github.com/shunting314	2024-05-13 17:36:52 +00:00
Jiong Gong	3267814d53	[inductor] refactor: device dispatch inside do_bench (#125736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125736 Approved by: https://github.com/shunting314	2024-05-09 23:50:02 +00:00
lezcano	320af5eaa6	Compute bounds for the variables created during codegen (#123100 ) Before we would just bail out on these bounds for all variables that did not come from the FX graph. Now we propagate the bounds whenever we have a rule for that op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-08 08:14:06 +00:00
PyTorch MergeBot	2a42c40791	Revert "Compute bounds for the variables created during codegen (#123100 )" This reverts commit `bb668c6468`. Reverted https://github.com/pytorch/pytorch/pull/123100 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing inductor tests `bb668c6468` ([comment](https://github.com/pytorch/pytorch/pull/123100#issuecomment-2096837821))	2024-05-06 20:23:39 +00:00
lezcano	bb668c6468	Compute bounds for the variables created during codegen (#123100 ) Before we would just bail out on these bounds for all variables that did not come from the FX graph. Now we propagate the bounds whenever we have a rule for that op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-05-06 18:12:15 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Edward Z. Yang	6f70d22277	Extend torch.utils._sympy.symbol for more Inductor symbols (#125419 ) I'm still missing a few, cdzq at least Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419 Approved by: https://github.com/lezcano ghstack dependencies: #125395	2024-05-04 09:05:00 +00:00
Edward Z. Yang	5503c29357	Introduce torch.utils._sympy.symbol (#125395 ) This provides utilities for creating and querying properties on sympy.Symbol. I want to use this refactor to get a better handle on how the 's' prefix is being used in Inductor. To start, I only do symbolic_shapes code because that's what I'm familiar with. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125395 Approved by: https://github.com/Skylion007	2024-05-03 21:24:23 +00:00
Kazuaki Ishizaki	9fec26e231	Fix typo under torch/_inductor directory (#119658 ) This PR fixes typo in comments and msgs under `torch/_inductor` directory, and also changes the corresponding test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119658 Approved by: https://github.com/colesbury	2024-04-30 22:28:56 +00:00
Sam Larsen	254128c16e	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Differential Revision: [D56723770](https://our.internmc.facebook.com/intern/diff/D56723770) Co-authored-by: Sam Larsen <slarsen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-30 16:54:16 +00:00
Shunting Zhang	dc514df2af	[inductor] add triton code to SchedulerNode.debug_str (#125091 ) Here is an example print: https://gist.github.com/shunting314/75c161368a833a535bd0d240b8099d7e Pull Request resolved: https://github.com/pytorch/pytorch/pull/125091 Approved by: https://github.com/jansel ghstack dependencies: #125090	2024-04-30 08:27:53 +00:00
chilli	7321005dd8	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Differential Revision: [D56583900](https://our.internmc.facebook.com/intern/diff/D56583900) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-26 01:02:28 +00:00
PyTorch MergeBot	f6ce94dca5	Revert "[inductor] Remove usage of device_interface from _inductor.runtime (#124592 )" This reverts commit `5d45eb77f1`. Reverted https://github.com/pytorch/pytorch/pull/124592 on behalf of https://github.com/jeanschmidt due to breaking internal tests, check D56522594 ([comment](https://github.com/pytorch/pytorch/pull/124592#issuecomment-2076957668))	2024-04-25 11:28:23 +00:00
PyTorch MergeBot	0ca1ff3dce	Revert "Add support for capturing tensors with score_mod (#124444 )" This reverts commit `7c253a7776`. Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, check D56522566 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2076908582))	2024-04-25 10:56:38 +00:00
chilli	7c253a7776	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-23 17:54:08 +00:00
Jason Ansel	5d45eb77f1	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-23 17:51:25 +00:00
PyTorch MergeBot	4f3e1f1c93	Revert "Add support for capturing tensors with score_mod (#124444 )" This reverts commit `e0c5113dec`. Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/malfet due to This is weird, but somehow profile test started to timeout after this PR, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=noGPU_AVX512 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2072506731))	2024-04-23 14:39:37 +00:00
chilli	e0c5113dec	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-23 06:20:13 +00:00
Jason Ansel	0093735ccd	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 18:46:40 +00:00
Jason Ansel	cb9fe91f5c	[inductor] Remove config check for 3D tiling (#124569 ) This makes the check per-kernel (if 3D tiling is used), rather than global config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124569 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560	2024-04-22 18:46:40 +00:00
Jason Ansel	7fd8870e6b	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 18:46:24 +00:00
Jason Ansel	bb8815bc31	[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124553 Approved by: https://github.com/yanboliang ghstack dependencies: #124552	2024-04-22 18:46:20 +00:00
Jason Ansel	480585fd2b	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 18:41:12 +00:00
PyTorch MergeBot	16eea7c6a5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 )" This reverts commit `a7035cc11a`. Reverted https://github.com/pytorch/pytorch/pull/124552 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	56714cb497	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553 )" This reverts commit `f4d47f5bbb`. Reverted https://github.com/pytorch/pytorch/pull/124553 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	0b90af0bf5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 )" This reverts commit `fcf28b0ad5`. Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	8973c5b846	Revert "[inductor] Remove config check for 3D tiling (#124569 )" This reverts commit `317c0af149`. Reverted https://github.com/pytorch/pytorch/pull/124569 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	30dec1da84	Revert "[inductor] Use compile time config values in runtime (#124561 )" This reverts commit `3af12447f8`. Reverted https://github.com/pytorch/pytorch/pull/124561 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124561#issuecomment-2070537634))	2024-04-22 18:24:38 +00:00
Jason Ansel	3af12447f8	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 04:51:30 +00:00
Jason Ansel	317c0af149	[inductor] Remove config check for 3D tiling (#124569 ) This makes the check per-kernel (if 3D tiling is used), rather than global config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124569 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560	2024-04-22 04:51:30 +00:00
Jason Ansel	fcf28b0ad5	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 04:51:15 +00:00
Jason Ansel	f4d47f5bbb	[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124553 Approved by: https://github.com/yanboliang ghstack dependencies: #124552	2024-04-22 04:51:09 +00:00
Jason Ansel	a7035cc11a	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 04:51:05 +00:00
Shunting Zhang	c5a4ba2257	[inductor] consider pointwise nodes when deciding reduction hint (#124131 ) In certain rare scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if - the reduction kernel contains a reduction node followed by a pointwise node - And the pointwise node use a transposed layout. - the reduction node is an inner reduction - and rnumel <= 1024 , then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` . I've tried a few version of fix. - The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings. - Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways. The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test. Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test. For this specific test from user reports, we improve the mentioned reduction kernels perf by 4.14x (451us -> 109us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131 Approved by: https://github.com/jansel	2024-04-20 05:07:56 +00:00
eellison	000d55870a	Enable in oss (#124031 ) Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825	2024-04-19 20:28:55 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Edward Z. Yang	efa36ef092	Natively support int truncation, don't guard on positive/negative (#122827 ) This doesn't entirely fix the original problem that prompted this, but it seems to just be getting stuck in export constraint formatting now which seems like progress to me. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827 Approved by: https://github.com/avikchaudhuri	2024-04-11 15:22:32 +00:00
leslie-fang-intel	bac2a39aee	[Inductor] [ReImplement] Outer Loop Fusion for CPP Backend (#121625 ) Summary Re-implement of https://github.com/pytorch/pytorch/pull/121064 Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_outer_loop_fusion ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121625 Approved by: https://github.com/lezcano, https://github.com/jgong5	2024-04-05 06:24:57 +00:00
Gao Tianlin	aaef246c74	remove log2 decomposition; add log2 lowering (#123112 ) Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112 Approved by: https://github.com/peterbell10	2024-04-02 16:16:26 +00:00
Peter Bell	09c72eaa3f	[inductor] Remove identity from ops.scan (#119727 ) Currently scan has an `init` argument which must be the identity of the combine function. This isn't strictly necessary if we are more careful about keeping track of the first element and avoid combining it with anything. This does additionally require that there are no active load masks, since we can't do the `where_cond` any more. However, this shouldn't be possible anyway since scans are always realized and only fused via the scheduler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119727 Approved by: https://github.com/lezcano	2024-04-01 22:47:26 +00:00
Edward Z. Yang	3178ba0dc9	Don't use sympy Float functions, use an opaque one with no reasoning (#122823 ) Sympy simplifications don't obey floating point semantics, so don't use Sympy for this. Keep them as is, only evaluate with the reference implementations when all arguments are known. This may end up getting subsumed by some other changes later, but I wanted to understand if this was easy and it seems to be easy. This doesn't actually depend on the earlier diffs on the stack and I can detach it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823 Approved by: https://github.com/lezcano	2024-03-29 19:13:55 +00:00
eellison	cbbed46377	Defer selection of triton template (#120275 ) Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways: - We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster - We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing. In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion. Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time. Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275 Approved by: https://github.com/jansel ghstack dependencies: #121996	2024-03-20 01:40:33 +00:00
Isuru Fernando	409b1a6081	Add lowering for cummax, cummin (#120429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120429 Approved by: https://github.com/peterbell10	2024-03-15 19:04:38 +00:00
eellison	6ca9ae4f86	Express y grid > 2^16 in terms of z grid (#121554 ) CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554 Approved by: https://github.com/aakhundov	2024-03-12 02:36:19 +00:00
Elias Ellison	5b5d423c2e	Benchmark templates (#118880 ) Adding support for benchmarking templates in `benchmark_fusion` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880 Approved by: https://github.com/shunting314	2024-03-11 23:55:13 +00:00
Peter Bell	168a04e752	[inductor] Changes to support newer triton pin (#121267 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267 Approved by: https://github.com/lezcano ghstack dependencies: #121438	2024-03-09 18:17:36 +00:00
Peter Bell	459c5bca58	[inductor] Refactor common triton imports into one function (#121438 ) This means when codegen depends on a particular import we only need to add it in one place and it's applied to all triton kernels. This also changes codegen slightly so instead of generating `@pointwise` we now generate `@triton_heuristics.pointwise` just so the imports are the same for all kernel types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438 Approved by: https://github.com/lezcano	2024-03-09 18:17:36 +00:00
Peter Bell	8887c95004	[inductor] Skip welford combine on first reduciton loop iteration (#121488 ) On the first iteration we short circuit `welford_reduce` since we know the accumulators are filled with the default values. This is split out from #120330 to hopefully avoid the meta-internal failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121488 Approved by: https://github.com/lezcano	2024-03-08 23:40:48 +00:00
Oguz Ulgen	6566b3db67	Add an autotune cache for inductor generated kernels (#120963 ) Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally. Test Plan: tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1` Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk The plan is to land this diff with this turned off and gradually introduce this. Differential Revision: D54398076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963 Approved by: https://github.com/jansel	2024-03-04 16:58:37 +00:00
PyTorch MergeBot	0b924d7cde	Revert "[inductor] Optimize welford reduction (#120330 )" This reverts commit `7eb7ac815f`. Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/kit1980 due to Broke internal tests, see D54230858 ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1971878323))	2024-02-29 20:12:50 +00:00

1 2 3 4 5 ...

362 Commits