Summary:
This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code:
`{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.`
Example paste: P1695235000 verified on the test model. See "Test Plan":
We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool:
https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab)
https://pxl.cl/66BzK
Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc.
Test Plan:
test_model_coverage.sh:
```
#!/bin/sh
MODEL_ENTITY_ID=644688112
SNAPSHOT_ID=32
MODULE=merge
# buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark
TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 | tee output.txt
```
{F1973765026}
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)'
```
```
TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor
```
Differential Revision: D66967510
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055
Approved by: https://github.com/chenyang78
We found that [this commit](6eca0aee76) caused a ~6% performance drop in ViT INT8. This was due to changes to the `numbytes_hint` for `NoneLayout`. In this PR, we reverted the changes in `numbytes_hint` to allow more fusions.
```
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.dense = torch.nn.Linear(768, 768)
self.layernorm = torch.nn.LayerNorm(768, eps=1e-12)
def forward(self, context_layer, hidden_states):
attention_output = self.dense(context_layer)
hidden_states = attention_output + hidden_states
layer_output = self.layernorm(hidden_states)
return layer_output
```
The generated code before (left) and after (right) this PR is as follows:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141766
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.
Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980
Approved by: https://github.com/chenyang78
# Summary
This PR adds an alternative triton lowering for _scaled_mm. This uses an updated mm template that utilizes persistent scheduling + TMAs on A and B matrices.
Limitations:
* This implementations does not work with Bias values: 0602676c8d/torch/_inductor/kernel/mm_scaled.py (L106) Plan is to remove this work around and enforce that both scaling + bias is properly done as epilogues onto the existing templates
* K dim must be 32 or greater for these to take effect
* Gated by a config flag ( currently defaults to Off, maybe should be on)
## Testing
We dont have any tests exercising this code in CI/CD but I updated the relevant tests in test_fp8 and they are all green:
<img width="1680" alt="Screenshot 2024-12-05 at 7 24 07 PM" src="https://github.com/user-attachments/assets/9c520541-d97a-416f-9af7-e68b366ec90f">
## Follow Ups
* Work to update the base mm triton templates and utilize the same template from mm/addmm/scaled_mm w/ respective epilogues
* Tuning on Persistent kernel configs. I found ones that work for my problem shapes but need to do some more NCU work
### Some profiling code I was using
Code I am using to iterate w/
```Python
import torch
from dataclasses import dataclass
from jsonargparse import CLI
import logging
from pathlib import Path
from transformer_nuggets.utils.benchmark import ProfileConfig, profile_function
from torchao.float8.inference import (
addmm_float8_unwrapped_inference,
preprocess_data,
Float8MMConfig,
)
from transformer_nuggets.fp8.fp8_matmul import (
matmul_persistent,
matmul_tma_persistent,
matmul_device_tma_persistent,
)
from enum import Enum
logging.getLogger("transformer_nuggets").setLevel(logging.INFO)
class FP8Kernel(Enum):
PERSISTENT = "Persistent"
PERSISTENT_TMA = "Persistent-TMA"
DEVICE_TMA = "Device-TMA"
SCALED_MM = "Scaled-MM"
class ScalingStrategy(Enum):
PER_TENSOR = "PerTensor"
PER_ROW = "PerRow"
@dataclass(frozen=True)
class ExperimentConfig:
M: int
K: int
N: int
scaling_strategy: ScalingStrategy
fp8_kernel: FP8Kernel
compile: bool
def get_fp8_matmul(
A: torch.Tensor,
B: torch.Tensor,
scaling_strategy: ScalingStrategy,
fp8_kernel: FP8Kernel,
):
A_fp8 = A.to(torch.float8_e4m3fn)
B_fp8 = B.to(torch.float8_e4m3fn)
A_fp8, B_fp8 = preprocess_data(A_fp8, B_fp8, Float8MMConfig(use_fast_accum=True))
if scaling_strategy == ScalingStrategy.PER_TENSOR:
a_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
b_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
elif scaling_strategy == ScalingStrategy.PER_ROW:
a_scale = torch.ones((A_fp8.size(0), 1), device="cuda", dtype=torch.float32)
b_scale = torch.ones((B_fp8.size(1), 1), device="cuda", dtype=torch.float32).T
else:
raise ValueError(f"Invalid scaling strategy: {scaling_strategy}")
assert fp8_kernel == FP8Kernel.SCALED_MM
return lambda: addmm_float8_unwrapped_inference(
A_fp8, a_scale, B_fp8, b_scale, output_dtype=torch.bfloat16, use_fast_accum=True
)
def run_matmul(config: ExperimentConfig):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
A = torch.randn(config.M, config.K, device=device, dtype=torch.bfloat16)
B = torch.randn(config.K, config.N, device=device, dtype=torch.bfloat16)
fp8_matmul = get_fp8_matmul(A, B, config.scaling_strategy, config.fp8_kernel)
if config.compile and config.fp8_kernel == FP8Kernel.SCALED_MM:
fp8_matmul = torch.compile(fp8_matmul, mode="max-autotune-no-cudagraphs")
_ = fp8_matmul()
return
def main():
torch.random.manual_seed(123)
# Define your experiment configuration here
config = ExperimentConfig(
M=8192,
K=8192,
N=8192,
scaling_strategy=ScalingStrategy.PER_TENSOR,
fp8_kernel=FP8Kernel.SCALED_MM,
compile=True,
)
run_matmul(config)
if __name__ == "__main__":
CLI(main)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142045
Approved by: https://github.com/eellison
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.
Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.
Follow ups:
- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
Fix https://github.com/pytorch/pytorch/issues/140462 .
Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous.
I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996
Approved by: https://github.com/jansel
Adding some dynamo timed for the purpose of better understanding AOTI compilation time.
Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there.
run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level?
traces:
<img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c">
Differential Revision: D65554141
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198
Approved by: https://github.com/desertfire
Here's the overview:
There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.
Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.
And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
Summary:
The recent tries on bandwidth profiler is not as expected. I have observed a few issues and tried to fix them in this diff:
1. The return of the DebugAutotuner class
2. Profiling results shows really large overhead.
DebugAutotuner.run() returns the benchmark time around 45ms while CachingAutotuner.run() returns the benchmark time around 0.45ms.
The `_find_names` and `re.match` takes 45ms: P1669186358
After we commenting out the above _find_names and re.match, the benchmark time become consistent with non-profiling mode: P1669185589
3. introduce a variable `bandwidth_info` to control the path in DebugAutotuner.run(). During benchmarking of configuration selection, we should turn off the `bandwidth_info`
After applying this diff, the profiling issues mentioned above are fixed: P1669273172
Test Plan:
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/tmp/profile.txt TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=5 buck run mode/{opt,inplace} scripts/wwei6/triton_examples:test_mat 2>&1 | tee profiling-5.log
```
If we want to disable the Aten backend, just add TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON"
Differential Revision: D64883079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139607
Approved by: https://github.com/chenyang78
Here's the overview:
There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.
Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.
And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
- Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def"
- Mark some trivial functions with the proper return types (`None` and `torch.dtype`)
- Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper()
- `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238
Approved by: https://github.com/Skylion007, https://github.com/ezyang