pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	9a7c2f1f64	Revert "Add torch compile force disable caches alias (#158072 )" This reverts commit `2ecf083b72`. Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))	2025-07-18 04:58:24 +00:00
Jack Taylor	7ebbf2cae7	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 ) (#158550 ) This reverts commit `8554c8007d` #157563 due to causing a few breakages on ROCm Reverted expected_results.csv to `26807dcf27` > @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: `f4d8bc46c7/2` and the failures are still showing up on tip-of-tree on HUD Context https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857 Needs to be relanded in non bc-breaking way, or sanity checked for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-07-17 19:47:41 +00:00
Oguz Ulgen	2ecf083b72	Add torch compile force disable caches alias (#158072 ) Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072 Approved by: https://github.com/ezyang	2025-07-17 15:40:36 +00:00
Shangdi Yu	82a1ee1135	Refactor Provenance Tracking (#158399 ) Summary: As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled` - change the guard flag from `trace.enabled` to `trace.provenance_tracking`. It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`. - Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call. - Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`. In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem. See more motivation in internal Diff Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing ``` Differential Revision: D78287976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399 Approved by: https://github.com/angelayi	2025-07-17 00:23:00 +00:00
Xuan Zhang	8554c8007d	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-16 01:05:25 +00:00
PyTorch MergeBot	26807dcf27	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit `c062550a35`. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main `c062550a35`, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test). Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))	2025-07-15 16:35:55 +00:00
Xiangyang (Mark) Guo	156a377f4c	[AOTI][CPP] add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL (#157949 ) Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time. Differential Revision: D77915987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949 Approved by: https://github.com/desertfire	2025-07-15 10:51:43 +00:00
Xuan Zhang	c062550a35	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-14 22:27:21 +00:00
PyTorch MergeBot	6ea91f0672	Revert "[Inductor] Set the default value of min_chunk_size to 512 (#150762 )" This reverts commit `3321acc92e`. Reverted https://github.com/pytorch/pytorch/pull/150762 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but an inductor compilation error shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150762#issuecomment-3070286787))	2025-07-14 16:58:13 +00:00
Sun, Jiayi	3321acc92e	[Inductor] Set the default value of min_chunk_size to 512 (#150762 ) Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized. I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2025-07-14 01:14:30 +00:00
bobrenjc93	5221448574	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-12 15:08:21 +00:00
Xuehai Pan	7f14b42adf	[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312 Approved by: https://github.com/albanD	2025-07-12 05:47:06 +00:00
PyTorch MergeBot	e90148c91d	Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563 )" This reverts commit `4b9a6f7211`. Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))	2025-07-12 04:52:11 +00:00
PyTorch MergeBot	e15f4248ad	Revert "[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 )" This reverts commit `7a92b51196`. Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250))	2025-07-12 04:40:52 +00:00
PyTorch MergeBot	9c189ed29a	Revert "multi-kernel matmuls based on varying hint sizes (#156628 )" This reverts commit `6c79530637`. Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))	2025-07-12 03:48:39 +00:00
Xuehai Pan	7a92b51196	[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312 Approved by: https://github.com/albanD	2025-07-12 01:47:22 +00:00
Xuan Zhang	4b9a6f7211	[PT2][fusion] ban fusions with large accumulated reads (#157563 ) Problem: Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet ``` total = torch.rand(N, N) for _ in range(r): x = torch.rand(N, N) total = total + x ``` The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like: ``` x_1 = torch.rand(N, N) x_2 = torch.rand(N, N) ... x_r = torch.rand(N, N) total = x_1 + x_2 + ... + x_r ``` Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient. [internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details Solution: Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile. * During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated. * During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`. Results: For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match. <img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563 Approved by: https://github.com/jansel, https://github.com/mlazos	2025-07-11 21:07:57 +00:00
bobrenjc93	6c79530637	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-11 19:38:10 +00:00
Xu Han	c4cdcda754	[aot] add format_consts_to_cpp function for further development. (#157608 ) Changes: 1. Split `format_consts_to_asm` function, which is current way to convert consts to object. 2. Add `format_consts_to_cpp` function, which would support for more compiler support, such as `msvc` and `icx`. 3. Add `config.aot_inductor.use_consts_asm_build` for `format_consts_to_asm` and `format_consts_to_cpp` control. 4. Add UT for `format_consts_to_cpp`. For `format_consts_to_cpp`, I have local tested it: Case: https://docs.pytorch.org/docs/main/torch.compiler_aot_inductor.html Run it and `cat` cpp code: <img width="674" alt="image" src="https://github.com/user-attachments/assets/d47ccf84-06d2-47f5-8a0d-9a43a9020aa3" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157608 Approved by: https://github.com/desertfire, https://github.com/jansel	2025-07-11 17:02:41 +00:00
Mwiza Kunda	ed508cc018	[inductor][triton] Add experimental use_tensor_descriptor config option (#157906 ) Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met. Some implementation details: - The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store - The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met. Testing: - Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option. This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906 Approved by: https://github.com/jansel	2025-07-11 09:32:40 +00:00
Sam Larsen	5bd7804be2	Support caching if joint_custom_pre_pass/joint_custom_post_pass implement the proper interface (#157990 ) Summary: Essentially, treat joint_custom_pre_pass/joint_custom_post_pass the same as post_grad_custom_post_pass/post_grad_custom_pre_pass. Test Plan: More unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/157990 Approved by: https://github.com/oulgen	2025-07-10 19:17:11 +00:00
Shangdi Yu	4781d72faa	[AOTI] codegen for static linkage (#157129 ) Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal) - Add codegen for static linkage - refactor test code for test_compile_after_package tests For now, the following options must be used together with `"aot_inductor.compile_standalone": True`. "aot_inductor.package_cpp_only": True, Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR. ``` python test/inductor/test_aot_inductor_package.py -k test_compile_after_package python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129 Approved by: https://github.com/desertfire	2025-07-10 16:03:50 +00:00
IvanKobzarev	8dff457f42	[simple_fsdp] Port fx pass to bucket reduce_scatters (#157780 ) Porting fx passes for reduce_scatters bucketing (similar to all_gather bucketing) for simple_fsdp and autoparallel testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157780 Approved by: https://github.com/wconstab	2025-07-10 14:04:43 +00:00
Paul Zhang	4cfc0a3208	[Inductor] Introduce Lookup Table for Overriding Triton Kernel autotune configs post fusion (#157924 ) Summary: Introduce lookup table for kernels post fusion, hashing on inductor generated source code Rollback Plan: Differential Revision: D77866885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157924 Approved by: https://github.com/jansel	2025-07-10 03:23:50 +00:00
Xiangyang (Mark) Guo	b354328ecd	[AOTI] add flag AOT_INDUCTOR_ENABLE_LTO (#157773 ) Add env var AOT_INDUCTOR_ENABLE_LTO to enable clang's ThinLTO by setting AOT_INDUCTOR_ENABLE_LTO=1. The LTO is disabled by default because it may increase the build time. Rollback Plan: Differential Revision: D77899195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157773 Approved by: https://github.com/desertfire	2025-07-09 16:54:19 +00:00
Shangdi Yu	effe376db0	Adding aoti_standalone config (#157731 ) Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if `package_cpp_only` is explicitly set to False in config. Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r TestAOTInductorConfig ``` Rollback Plan: Differential Revision: D77889754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731 Approved by: https://github.com/desertfire	2025-07-09 04:30:04 +00:00
Sam Larsen	7a41f20794	[inductor] Quiesce Triton compile worker pool after each dynamo compile (#156187 ) For internal usages, keeping the Triton compile worker pool active for the lifetime of the process has caused some challenges, e.g., it slows down and muddies profiling due to the huge number of threads on a box: N threads = 8 ranks * 32 subprocs * M threads started by torch. Also, each subproc can use more than 1GB each. This PR adds the functionality to shutdown worker subprocs after each dynamo compile when using the SubprocPool implementation. The idea is to leave the main sidecar process running, but signal it to tear down its internal ProcessPoolExecutor when compile is finished. Restarting the ProcessPoolExecutor is relatively fast, e.g., 500ms because the ProcessPoolExecutor forks from the sidecar. Changes: * Do not start the ProcessPoolExecutor automatically when compile_fx is imported. Instead, start the sidecar process only. The sidecar process imports torch, so is still slow to start. * Introduce wakeup() and quiesce() calls to the implementation to start and stop the ProcessPoolExecutor. * Add a context manager to automatically quiesce() at the end of dynamo compilation. * Signal a wakeup() in compile_fx only when we have cuda devices. * Add a killswitch so we can turn of quiescing. Testing: For correctness, the stacked change at https://github.com/pytorch/pytorch/pull/156534 enables the feature for OSS so it's exercised in CI. For performance, because of recent compile-time variance (see https://github.com/pytorch/pytorch/issues/152566), it's pretty hard to glean whether there's a regression.... * Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 * Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/210/head&lCommit=1b7315031c3bfad66a1a01700167a9ca1a2ae5f1&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 The wins (mostly for inference) don't make sense, but I'm also skeptical of the losses (mostly for training). I can't repro any of the slowdowns locally. Furthermore, check out the benchmarking results for the stacked diff, which actually enables the quiescing functionality for OSS. That should only slow down compile since there can only be overhead to stop and start the workers. But the results are somehow better: * Training: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 * Inference: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2017%20Jun%202025%2021%3A32%3A04%20GMT&stopTime=Tue%2C%2024%20Jun%202025%2021%3A32%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/masnesral/214/head&lCommit=41943253882a019b8ceafcd2bf4cd6acbe0cbca9&rBranch=main&rCommit=eab45643f22e58ee12d95d8b0162d51ca0a50801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156187 Approved by: https://github.com/aorenste, https://github.com/jansel	2025-07-08 22:53:13 +00:00
IvanKobzarev	7b392bac13	all_gather_bucketing fx pass (#157396 ) Porting passes to bucket all_gathers The main logic of the pass is done via 1. Searching for all all_gathers from the buckets Copying tests from @wconstab PR to test compatibility with reordering. Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering. Pass changes: Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops. Inductor changes: Adding foreach_copy_ lowering Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396 Approved by: https://github.com/wconstab	2025-07-03 22:07:42 +00:00
Nicolas Macchioni	3bdd5ae334	[PT2] deprecate `force_same_precision`, guarded by JK (#156789 ) Summary: cuBLAS used to have strict alignment requirements for TF32 usage, even if TF32 was enabled by users; this caused a numeric SEV in the past, when Triton would use TF32 even if cuBLAS could not due to failing the alignment checks we believe that cuBLAS no longer has alignment requirements for TF32 usage, based on some testing in D77265581; we'd like to deprecate `force_same_precision` since it no longer functions as expected changing the default to False in fbcode, guarded by a jk so that we can quickly revert to the original behavior if needed Test Plan: CI Rollback Plan: Differential Revision: D77265930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156789 Approved by: https://github.com/jhadidjojo, https://github.com/masnesral	2025-06-27 00:43:06 +00:00
Nicolas Macchioni	13efb2c858	[BE] Deprecate `search_autotune_cache` (#155302 ) We haven't had the offline cache populated in > 1 year, this should be safe; if this passes, we can finally go through and rip out the offline cache logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/155302 Approved by: https://github.com/masnesral	2025-06-26 17:30:08 +00:00
James Wu	e581f015ee	Bump STATIC_CUDA_LAUNCHER_VERSION to 2 (#156726 ) Differential Revision: [D77241813](https://our.internmc.facebook.com/intern/diff/D77241813) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156726 Approved by: https://github.com/oulgen	2025-06-26 01:50:51 +00:00
Boyuan Feng	1044934878	[CUDAGraph] add config `cudagraph_capture_sizes` (#156551 ) Users may want CUDAGraph for certain sizes and fallback for other sizes. As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others. Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback. This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes. Example: ```python import torch torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)] def f(x): return x + 1 f = torch.compile(f, mode="reduce-overhead", dynamic=False) def run(batch_size, seq_len, d): x = torch.randn((batch_size, seq_len, d), device="cuda") # Need to mark the dimension as dynamic. Automated-dynamic # may have some ux issues on matching `cudagraph_capture_sizes` # with the actual dynamic shapes, since there are specialization and # multiple dynamo graphs. torch._dynamo.mark_dynamic(x, 0) torch._dynamo.mark_dynamic(x, 1) for _ in range(3): f(x) for i in range(2, 10): for j in range(2, 10): run(i, j, 8) num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id() assert num_cudagraph.id == 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551 Approved by: https://github.com/bobrenjc93	2025-06-24 05:14:49 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
leslie-fang-intel	c55eef79f8	[Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395 ) Summary Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance. Test Plan ``` python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395 Approved by: https://github.com/jansel ghstack dependencies: #156407, #156387	2025-06-23 02:00:42 +00:00
Jack Taylor	03023f178c	FlexAttn config refactor + ROCm optimisations (#156307 ) This PR primarily unifies the flex attention config logic with the GEMM/Conv config approach https://github.com/pytorch/pytorch/pull/147452 this will make it much easier to handle optimisation pathways for particular triton backends. This PR also introduces: 1. Introduces an exhaustive tuning mode for flex attention via TORCHINDUCTOR_MAX_AUTOTUNE_FLEX_SEARCH_SPACE="EXHAUSTIVE" to allow for wide scale benchmarking for perf investigation use cases. 3. Updates configs for ROCm flex autotune path providing perf optimisations AMD perf numbers on score mod benchmark (default inputs) flex_attn \| mode \| Speedup (Avg) \| Speedup (Max) -- \| -- \| -- \| -- fwd \| autotune before PR \| 2.608 \| 20.56 fwd \| autotune after PR \| 2.862 \| 22 fwd \| exhaustive_autotune \| 2.943 \| 22.471 bwd \| autotune before PR \| 2.196 \| 9.831 bwd \| autotune after PR \| 2.423 \| 11.331 bwd \| exhaustive_autotune \| 2.566 \| 13.87 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156307 Approved by: https://github.com/drisspg, https://github.com/jansel	2025-06-22 22:27:38 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit `3627270bdf`. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
Menglu Yu	0ad88a2224	Support environement var for autotune log (#156254 ) Summary: Titled Test Plan: See the scadcastle signal Rollback Plan: Differential Revision: D76860928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254 Approved by: https://github.com/Mingming-Ding	2025-06-20 23:06:33 +00:00
Shunting Zhang	39270430c9	[inductor] force min num-split (off by default) (#155941 ) This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ). The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance. The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order: - [xnumel * num_split, rnumel / num_split] - [xnumel, num_split] A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2. Here are some benchmarking results. ``` import torch from triton.testing import do_bench import functools from torch._inductor import config from torch._dynamo.decorators import mark_dynamic import os @torch.compile(dynamic=True) def f(x): return x.sum(dim=0) N = 512 C = functools.partial(torch.randn, device="cuda") x_small = C(4096, N) x_large = C(4096 * 1000, N) if os.getenv("HINT_WITH_SMALL_INPUT") == "1": x = x_small else: x = x_large mark_dynamic(x, 0) f(x) ms = do_bench(lambda: f(x_large)) # 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15 # 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272 # 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba print(ms) ``` This test mimic what we see in the original problem. - If we compile with large inputs and benchmark for large inputs, latency is 4.03ms - if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms - with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test. - Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms - Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62 Perf is neutral (0.011 ms v.s. 0.011ms) A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325 # Update for not applying the change to static shape: from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example: ``` import torch from torch._inductor import config aten = torch.ops.aten def f(x): return aten.bernoulli(x).sum() x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda") torch.compile(f)(x) ``` With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue. Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6) Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941 Approved by: https://github.com/jansel, https://github.com/bobrenjc93	2025-06-20 18:01:28 +00:00
CaoE	159a39ad34	Add an option for cpp_wrapper to compile entry and kernel separately (#156050 ) Fixes #156037. Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050 Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel	2025-06-20 01:11:16 +00:00
Shangdi Yu	eaf704914e	[aoti] package weights to disk and dedup (#155241 ) We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name. Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`). - Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile - If we see `Weights` in aoti_files, we'll automatically package them to disk - `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently. - Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False. Test Plan: ``` buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights" ``` Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7 Rollback Plan: Differential Revision: D74747190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241 Approved by: https://github.com/desertfire	2025-06-19 17:17:17 +00:00
Ruben Rodriguez Buchillon	bdb1553b77	[inductor][cutlass] binary remote cache (#156248 ) Summary: # Why speed up cutlass kernel generation and retrieval # What using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally this is the OSS only part of the change, to facilitate integration Test Plan: ## prove that we can upload successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so 649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can download successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can upload errors successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error 4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## prove that we can download errors successfully ``` buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## showing timing information ``` I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s) I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s) ``` Reviewed By: henrylhtsang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248 Approved by: https://github.com/henrylhtsang	2025-06-18 06:51:22 +00:00
Oguz Ulgen	8e02cd9c5a	Skip cache related configs for cache config serialization (#156195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195 Approved by: https://github.com/masnesral	2025-06-17 21:24:07 +00:00
PyTorch MergeBot	ec08eb8ba2	Revert "[inductor][cutlass] binary remote cache (#156106 )" This reverts commit `9a2c669425`. Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))	2025-06-17 19:07:49 +00:00
Ruben Rodriguez Buchillon	9a2c669425	[inductor][cutlass] binary remote cache (#156106 ) Summary: # Why speed up cutlass kernel generation and retrieval # What using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally Test Plan: ## prove that we can upload successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so 649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can download successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so ``` ## prove that we can upload errors successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` manifold ls coconutruben-test-01/tree/cutlass_concept_2 4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error 4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## prove that we can download errors successfully ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 ``` ``` I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error ``` ## showing timing information ``` I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s) I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s) ``` Rollback Plan: Reviewed By: henrylhtsang Differential Revision: D76454741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156106 Approved by: https://github.com/henrylhtsang Co-authored-by: atalman <atalman@fb.com>	2025-06-17 16:24:10 +00:00
penknife6153	3e38feb05f	[inductor] Add configuration control for CUTLASS operation selection. (#155770 ) Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time. Fixes #155718 ## Usage Examples ```bash # Enable CUTLASS for all operations (default behavior) export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL" # Enable CUTLASS only for matrix multiplication operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm" # Enable CUTLASS only for batch operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm" # Disable CUTLASS for all operations export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770 Approved by: https://github.com/henrylhtsang	2025-06-14 08:19:54 +00:00
Sean McGovern	297805fd8f	Typo fixes for "overridden" in comments and function names (#155944 ) This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944 Approved by: https://github.com/Skylion007	2025-06-14 03:37:38 +00:00
Bin Bao	f151b20123	[AOTI] Remove the emit_current_arch_binary option (#155768 ) Summary: Remove the option as generating fatbin with PTX only doesn't work on H100, so switch to always include one PTX and one SASS for fatbin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155768 Approved by: https://github.com/angelayi	2025-06-13 02:06:07 +00:00
Brian Hirsh	a2b0b2698d	inductor codecache: include private inductor configs in cache key (#153672 ) Fixes https://github.com/pytorch/torchtitan/issues/1185 It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be. I'm not sure how worried we should be on the blast radius of this change. On the one hand: (1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few) (2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672 Approved by: https://github.com/oulgen	2025-06-11 01:33:24 +00:00
Michael Lazos	5dfe1787b5	[Inductor] Limit fusions to a node distance of 64 (#154688 ) fix for https://github.com/pytorch/pytorch/issues/154652 and https://fb.workplace.com/groups/1075192433118967/permalink/1484799079148049/ [window 128 dashboard run here w/ no regressions](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sun%2C%2001%20Jun%202025%2006%3A38%3A41%20GMT&stopTime=Sun%2C%2008%20Jun%202025%2006%3A38%3A41%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=mlazos/fuse-window&lCommit=8576f00ebfa53567d7bddc89d9882df9eb990561&rBranch=main&rCommit=9d59b516e9b3026948918e3ff8c2ef55a33d13ad) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154688 Approved by: https://github.com/eellison, https://github.com/Raymo111	2025-06-10 07:32:23 +00:00

1 2 3 4 5 ...

674 Commits