Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`
- change the guard flag from `trace.enabled` to `trace.provenance_tracking`. It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.
In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.
See more motivation in internal Diff
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```
Differential Revision: D78287976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
x = torch.rand(N, N)
total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 = torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.
[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details
**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.
**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.
<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
Summary: Add flag TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL to force inline the kernel function when TORCHINDUCTOR_CPP_FORCE_INLINE_KERNEL=1. It's disabled by default because force inlining may increase the build time.
Differential Revision: D77915987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157949
Approved by: https://github.com/desertfire
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
x = torch.rand(N, N)
total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 = torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.
[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details
**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.
**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.
<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
x = torch.rand(N, N)
total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 = torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.
[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details
**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.
**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.
<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met.
Some implementation details:
- The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store
- The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met.
Testing:
- Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option.
This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906
Approved by: https://github.com/jansel
Design doc: https://docs.google.com/document/d/1ncV7RpJ8xDwy8-_aCBfvZmpTTL824C-aoNPBLLVkOHM/edit?tab=t.0 (internal)
- Add codegen for static linkage
- refactor test code for test_compile_after_package tests
For now, the following options must be used together with `"aot_inductor.compile_standalone": True`.
"aot_inductor.package_cpp_only": True,
Will change `"aot_inductor.package_cpp_only"` to be automatically set to True in followup PR.
```
python test/inductor/test_aot_inductor_package.py -k test_compile_after_package
python test/inductor/test_aot_inductor_package.py -k test_run_static_linkage_model
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157129
Approved by: https://github.com/desertfire
Add env var AOT_INDUCTOR_ENABLE_LTO to enable clang's ThinLTO by setting AOT_INDUCTOR_ENABLE_LTO=1. The LTO is disabled by default because it may increase the build time.
Rollback Plan:
Differential Revision: D77899195
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157773
Approved by: https://github.com/desertfire
Summary: When `compile_standalone` is True, we set `package_cpp_only` to True as well. We raise an error if `package_cpp_only` is explicitly set to False in config.
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r TestAOTInductorConfig
```
Rollback Plan:
Differential Revision: D77889754
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157731
Approved by: https://github.com/desertfire
Porting passes to bucket all_gathers
The main logic of the pass is done via
1. Searching for all all_gathers from the buckets
Copying tests from @wconstab PR to test compatibility with reordering.
Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering.
Pass changes:
Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops.
Inductor changes:
Adding foreach_copy_ lowering
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396
Approved by: https://github.com/wconstab
Summary:
cuBLAS used to have strict alignment requirements for TF32 usage, even if TF32 was enabled by users; this caused a numeric SEV in the past, when Triton would use TF32 even if cuBLAS could not due to failing the alignment checks
we believe that cuBLAS no longer has alignment requirements for TF32 usage, based on some testing in D77265581; we'd like to deprecate `force_same_precision` since it no longer functions as expected
changing the default to False in fbcode, guarded by a jk so that we can quickly revert to the original behavior if needed
Test Plan:
CI
Rollback Plan:
Differential Revision: D77265930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156789
Approved by: https://github.com/jhadidjojo, https://github.com/masnesral
Users may want CUDAGraph for certain sizes and fallback for other sizes.
As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others.
Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback.
This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes.
Example:
```python
import torch
torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)]
def f(x):
return x + 1
f = torch.compile(f, mode="reduce-overhead", dynamic=False)
def run(batch_size, seq_len, d):
x = torch.randn((batch_size, seq_len, d), device="cuda")
# Need to mark the dimension as dynamic. Automated-dynamic
# may have some ux issues on matching `cudagraph_capture_sizes`
# with the actual dynamic shapes, since there are specialization and
# multiple dynamo graphs.
torch._dynamo.mark_dynamic(x, 0)
torch._dynamo.mark_dynamic(x, 1)
for _ in range(3):
f(x)
for i in range(2, 10):
for j in range(2, 10):
run(i, j, 8)
num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id()
assert num_cudagraph.id == 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551
Approved by: https://github.com/bobrenjc93
**Summary**
Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance.
**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395
Approved by: https://github.com/jansel
ghstack dependencies: #156407, #156387
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.
Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).
- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.
Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```
Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7
Rollback Plan:
Differential Revision: D74747190
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
Summary:
# Why
speed up cutlass kernel generation and retrieval
# What
using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally
this is the OSS only part of the change, to facilitate integration
Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```
## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```
## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```
## prove that we can download errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```
```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```
## showing timing information
```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```
Reviewed By:
henrylhtsang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
Added a new configuration option `cutlass_enabled_ops` that allows users to control which operations use CUTLASS lowerings. By default, CUTLASS is enabled for all operations (maintaining backward compatibility), but users can now selectively enable it only for specific operations to optimize compilation time.
**Fixes #155718**
## Usage Examples
```bash
# Enable CUTLASS for all operations (default behavior)
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="ALL"
# Enable CUTLASS only for matrix multiplication operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="mm,addmm"
# Enable CUTLASS only for batch operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS="bmm,baddbmm"
# Disable CUTLASS for all operations
export TORCHINDUCTOR_CUTLASS_ENABLED_OPS=""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155770
Approved by: https://github.com/henrylhtsang
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
Fixes https://github.com/pytorch/torchtitan/issues/1185
It looks like inductor's logic to include inductor configs in the cache key skips configs with a leading underscore by default. This came up in torchtitan - there's an asyncTP pipelining pass in inductor gated by a private config, and by not caching on the config we were attempting to use asyncTP when we shouldn't be.
I'm not sure how worried we should be on the blast radius of this change. On the one hand:
(1) it technically fixes any silent correctness issues in the cache around any other private inductor configs (it looks like there are a few)
(2) there is some risk that there are some "harmless" configs that we are now including in the key, which may increase false negatives. I do see that there is an explicit list for "configs we want to ignore for caching" (`_save_config_ignore`), so my hope is that all harmless configs are already encapsulated there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153672
Approved by: https://github.com/oulgen