Commit Graph

1215 Commits

Author SHA1 Message Date
Tugsbayasgalan Manlaibaatar
5790b00975 [RELAND] Close some sources of fake tensor leakage (#161589)
Reland of https://github.com/pytorch/pytorch/pull/159923

Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability.

2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict

3. We modify yolov3 to fix the previous silent incorrect behaviour
4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking

When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.

Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589
Approved by: https://github.com/avikchaudhuri
2025-08-28 09:46:42 +00:00
rzou
199c3633bf Fix Inductor Periodic (#161617)
Models are now passing accuracy. # of graph breaks is larger because
these were not actually tested in CI (if the model fails accuracy we
do not assert on # of graph breaks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161617
Approved by: https://github.com/anijain2305
2025-08-28 02:36:08 +00:00
Animesh Jain
07a4e9fea8 [benchmarks] Skip mobilenetv3_large_100 in CI for accuracy (#161570)
To keep the CI green - https://github.com/pytorch/pytorch/issues/161419

Its unclear if this is a real failure. And debugging it is non trivial.
Skipping for now to keep the CI greenst

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161570
Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519
2025-08-27 03:44:04 +00:00
rzou
4e19c1906a Get Inductor periodic CI green (#161297)
I'll file hi-pri issues for the things that need looking into.

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161297
Approved by: https://github.com/angelayi
2025-08-26 00:49:49 +00:00
Yiming Zhou
9d882fd9ff [benchmark] Add torchscript jit.trace to benchmark option (#161223)
For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter

```
python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace

python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223
Approved by: https://github.com/huydhn
2025-08-22 21:38:28 +00:00
Will Feng
8047cde0f3 Try to fix Inductor CI periodic tests (#160932)
- hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count.
- dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work.
- hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932
Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet
2025-08-20 20:36:46 +00:00
Nichols A. Romero
0298ebc97a [ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list (#160001)
GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small.  This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`.

In the PR comment below, I include a short analysis of the numerics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001
Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily
2025-08-18 15:33:30 +00:00
Sun, Jiayi
95e456fcc5 [inductor] pack linear for FP32 dynamic mode (#157542)
Summary:
Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode.
I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2%  improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542
Approved by: https://github.com/CaoE, https://github.com/jansel
2025-08-18 10:18:46 +00:00
Laith Sakka
6f0f4e0c3e reduce threshold to suggest changes to expected results (#160463)
Since we increase threshold to 10% i would like suggestions to show up to update those +-2% instead of 3.3% now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160463
Approved by: https://github.com/jamesjwu
2025-08-14 09:11:27 +00:00
Laith Sakka
dd21c8a578 refresh expected results (#160537)
regression introduced  by https://github.com/pytorch/pytorch/pull/160314
not much worried about it since it did not effect other inductor benchmarks could not repo locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160537
Approved by: https://github.com/eellison
2025-08-14 00:56:14 +00:00
Laith Sakka
96bd33b2de Fix get_free_symbol_uses for several nodes (#160314)
get_free_symbol_uses is used to know what unbacked symbols are used by a given node.
not having correct get_free_symbol_uses defined properly leads to :

- eliminating of some nodes due to not detection of any users. (See the added unit test)
- Incorrect topological sort.

Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel.
for ComputedBuffer with NonOwningLayout its interesting case.
when layout is NonOwningLayout we need to access the actual view op base layout and use
detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314
Approved by: https://github.com/eellison
2025-08-13 12:28:29 +00:00
Animesh Jain
01bcf9a40d Bump transformers pin (#159291)
Trying to update hf pin.

Benchmarking run to figure out issues

<img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" />

Retrying - https://github.com/pytorch/pytorch/pull/156118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291
Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-08-12 05:14:17 +00:00
Yiming Zhou
017259f9c6 [benchmarks] Add nativert benchmark (#159922)
Add NativeRT as an option in the PT2 OSS benchmark

```
python ./benchmarks/dynamo/huggingface.py --performance --inference --export-nativert

python ./benchmarks/dynamo/timm_models.py --performance --inference --export-nativert

python ./benchmarks/dynamo/torchbench.py --performance --inference --export-nativert
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159922
Approved by: https://github.com/angelayi
2025-08-08 03:38:32 +00:00
Laith Sakka
1bb5e6c076 update expected results (#159867)
refresh due to https://github.com/pytorch/pytorch/pull/159696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159867
Approved by: https://github.com/masnesral
2025-08-07 01:18:36 +00:00
Laith Sakka
978e3a9142 refresh expected results (#159727)
Just regular update due to recent <10% changes CI is stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159727
Approved by: https://github.com/anijain2305
2025-08-03 22:47:50 +00:00
Aaron Orenstein
3f86076775 gc before warming up benchmarking (#159670)
#158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670
Approved by: https://github.com/oulgen
2025-08-02 19:37:24 +00:00
Huy Do
465fe4d9f7 Enable sample nightly PT2 benchmark on B200 (#158011)
Per the discussion with @nWEIdia, this resumes the work on https://github.com/pytorch/pytorch/pull/157870 to enable PT2 benchmark on B200

### Testing

https://github.com/pytorch/pytorch/actions/runs/16615101382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158011
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-08-01 23:47:44 +00:00
LifengWang
838924436e update the baseline for nightly max_autotune tests (#154973)
Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data:

In the latest nightly test, two models require baseline updates:

- vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly.
- detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model.

```
vision_maskrcnn                     IMPROVED:           graph_breaks=29, expected=30
Improvement: 1 models have fixed dynamo graph breaks:
    vision_maskrcnn
```

```
detectron2_fcos_r_50_fpn            XFAIL
detectron2_fcos_r_50_fpn            FAIL:               graph_breaks=24, expected=22
Error: 1 models have new dynamo graph breaks:
    detectron2_fcos_r_50_fpn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973
Approved by: https://github.com/desertfire
2025-07-31 11:38:55 +00:00
Arsh Zahed
24d07b3a67 [inductor] Fix mm decomposition evaluating symints (#158998)
Fixes #154111

Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.

The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
2025-07-30 16:34:15 +00:00
Animesh Jain
8c0c5c58c7 [benchmarks] Set model name early to keep warmup and main model same (#159231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159231
Approved by: https://github.com/williamwen42
ghstack dependencies: #159209
2025-07-28 18:18:16 +00:00
Xuehai Pan
f5e2de928b [BE] fix remaining flake8 v7 warnings (#159044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044
Approved by: https://github.com/Skylion007
ghstack dependencies: #159043
2025-07-25 02:56:34 +00:00
James Wu
f55c5d085e [Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.

The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.

## Test Plan

After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --inference --backend inductor  --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
2025-07-24 14:09:54 +00:00
Aditya Tewari
7001d6fbc9 Skip slow tests for aarch64-inductor-benchmarks (#158842)
This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench.
The suggested models takes a long time which leads to the benchmark runs being `timeout`.  [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml)

•	The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms  , which significantly slow down execution leading to the timeout.
**Action:**
•	An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025.

To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in:
      * timm benchmarks
      * torchbench benchmarks

 As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but:
•	An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated.

Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842
Approved by: https://github.com/malfet
2025-07-24 00:21:38 +00:00
Guilherme Leobas
b67f97c166 Correctly handle OP_CONTAINS (#158660)
CPython can fallback to `__iter__` if object doesn't implement
`__contains__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158660
Approved by: https://github.com/zou3519
2025-07-23 22:31:51 +00:00
PyTorch MergeBot
76be282e3a Revert "[Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)"
This reverts commit d898d0d437.

Reverted https://github.com/pytorch/pytorch/pull/158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](https://github.com/pytorch/pytorch/pull/158847#issuecomment-3109664713))
2025-07-23 18:25:46 +00:00
James Wu
d898d0d437 [Precompile] Various small bugfixes, add CachingPrecompile to torchbench (#158847)
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.

The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.

## Test Plan

After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance  --inference --backend inductor  --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
2025-07-23 15:06:54 +00:00
Boyuan Feng
a155f742ad [benchmark] allow default mode for compile (#158792)
Allow default mode for compile when users cannot run "max-autotune-no-cudagraphs" due to compilation time. Overall, "default" mode is slower than "[max-autotune-no-cudagraphs](https://github.com/pytorch/pytorch/pull/158536)" depending on input shapes.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/5d25c0e4-6714-42bb-a544-b7ef9cbc1b17" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/40e0bbf9-657f-48f2-ac0c-1f0fd6a0ac1d" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/db582bb2-d8d4-414a-9de7-b9af061ad0cd" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/2ce18bd8-73fc-434a-820f-46aa9ad9ddce" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/f4cb5f4b-93d3-4d96-973f-37643912325a" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/231c5805-b156-4587-9c5f-504a33b60883" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/f651c578-813b-4a8e-bffc-b5b34bd879fc" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/bfdcc043-4370-4355-af84-9f463426b21a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158792
Approved by: https://github.com/zou3519
2025-07-22 03:07:22 +00:00
Benjamin Glass
22920c9138 Grab bag of (mostly) typing improvements (#158075)
Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are:

1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`.
2. Removing two unused parameters from `torch.export._unlift._unlift`.
3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency.
4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075
Approved by: https://github.com/Skylion007
2025-07-21 19:17:01 +00:00
Boyuan Feng
22d82222c6 GenAI Layer Benchmark (#158536)
This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack.

It covers all kernels supported by [quack](https://github.com/Dao-AILab/quack?tab=readme-ov-file#kernels-) (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd.

## Motivations

- Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark.
- We recently added GenAI model benchmark (based on [vLLM](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm)). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize.

## Key Settings

- Avoid reusing a kernel specializing on 1 shape for benchmark on another shape.
```python
torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000
```

- For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark.
```
torch._dynamo.mark_dynamic(x, 0)
```

GPU: H100 (devvm006.dkl0)

Results: [P1874246170](https://www.internalfb.com/phabricator/paste/view/P1874246170)

Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for `torch.bfloat16`, use rtol `1.6e-2` and atol `1e-5`). It shows numerical issues for some backends and kernels.

Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns.

<img width="3564" height="2368" alt="CrossEntropyBackward_bench" src="https://github.com/user-attachments/assets/7aa77ad1-83eb-41ea-a27d-50fd5b1dd6be" />
<img width="3564" height="2368" alt="CrossEntropyForward_bench" src="https://github.com/user-attachments/assets/a26ec028-3791-4a41-a12a-05e10f60e9aa" />
<img width="3564" height="2368" alt="LayerNormBackward_bench" src="https://github.com/user-attachments/assets/cc6673ed-c148-4dd2-a729-5f02e717ab3e" />
<img width="3564" height="2368" alt="LayerNormForward_bench" src="https://github.com/user-attachments/assets/f71f9f9d-7b45-4ce7-89d0-e9bce727efae" />
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/e012821a-b7e6-4e83-a24c-c97fa8cd37b5" />
<img width="3564" height="2368" alt="RMSNormForward_bench" src="https://github.com/user-attachments/assets/2d52ee1e-9a8c-4bd1-a180-97b93f07171d" />
<img width="3564" height="2368" alt="SoftmaxBackward_bench" src="https://github.com/user-attachments/assets/02aad056-3ce1-4b40-8cfe-adae81fd017a" />
<img width="3564" height="2368" alt="SoftmaxForward_bench" src="https://github.com/user-attachments/assets/779f6b0d-a102-4164-8300-86fff0329ddf" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158536
Approved by: https://github.com/yf225, https://github.com/eellison
2025-07-19 05:41:01 +00:00
Laith Sakka
64dabb2cf5 only fail regressions>10% on pr_time benchmarks (#158577)
Moving to a new framework, maintaitning the pr_time benchmark test right now is hard and often breaking.
1. only fail PRs >10% regressions.
2. post monitor with pr_time benchmarks dashboard (oncall), and update expected results (frequently or on big changes)
(supposed to already be doing https://www.internalfb.com/unidash/dashboard/pt2_diff_time_metrics)
3. setting up some one detections  detectors warnings that would be triggered at regressions and notify internally post land
https://www.internalfb.com/monitoring/detector/1140915271179237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158577
Approved by: https://github.com/xmfan, https://github.com/janeyx99
2025-07-19 04:35:31 +00:00
Jack Taylor
7ebbf2cae7 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563) (#158550)
This reverts commit 8554c8007d #157563 due to causing a few breakages on ROCm

Reverted expected_results.csv to 26807dcf27

> @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: f4d8bc46c7/2
and the failures are still showing up on tip-of-tree on HUD

Context
https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857

Needs to be relanded in non bc-breaking way, or sanity checked for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-17 19:47:41 +00:00
Laith Sakka
306dd19216 update expeced results (#158497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158497
Approved by: https://github.com/xmfan
2025-07-17 00:02:52 +00:00
Xuan Zhang
8554c8007d [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-16 01:05:25 +00:00
PyTorch MergeBot
26807dcf27 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit c062550a35.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main c062550a35, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test).  Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))
2025-07-15 16:35:55 +00:00
PyTorch MergeBot
4f36743f5e Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)"
This reverts commit 5a54db14e3.

Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140))
2025-07-15 16:31:13 +00:00
IvanKobzarev
5a54db14e3 [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative (#158062)
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-15 14:27:57 +00:00
Xuan Zhang
c062550a35 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-14 22:27:21 +00:00
PyTorch MergeBot
e90148c91d Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit 4b9a6f7211.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))
2025-07-12 04:52:11 +00:00
Xuan Zhang
4b9a6f7211 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-11 21:07:57 +00:00
Xuehai Pan
af3d069094 [BE][Easy] remove unused build-time dependency astunparse and change astunparse.unparse -> ast.unparse (#157907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157907
Approved by: https://github.com/Skylion007
2025-07-10 07:04:42 +00:00
Ryan Guo
f742b32a2f [dynamo] Avoid recompiling over unused objects (#156891)
Dynamo was aggressively specializing on lazy VTs over `set_name_hint` in
`STORE_FAST`, etc., and `isinstance` in `LOAD_FAST_CHECK`. This causes
regional `torch.compile` from optimizing ComfyUI GGUF + LoRA to either
(1). exceed the recompialtion limit of 8, which results in suboptimal
performance, and (2). even if recompilation limit is increased, the
compilation time gets unnecessarily high (180s v.s. 20s for Flux).

This patch fixes the recompilation issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156891
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2025-07-09 20:14:34 +00:00
Weishi.Deng
44d0800d60 [Intel GPU] Set higher tolerance for squeezenet1_1 with bf16 (#156920)
We need to increase the tolerance slightly to ensure that certain models pass the accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for CUDA/CPU devices and introduces a new key, higher_bf16_xpu, which only affects the XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156920
Approved by: https://github.com/soulitzer
2025-07-08 17:49:54 +00:00
Jason Ansel
41f6acef83 Update pr_time_benchmarks expected results (#157214)
The job has been unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157214
Approved by: https://github.com/laithsakka
2025-06-29 19:12:13 +00:00
Laith Sakka
e6ed4074e8 update expected results (#157010)
<img width="1490" alt="Screenshot 2025-06-26 at 12 30 46 PM" src="https://github.com/user-attachments/assets/4df626d4-3010-4362-974c-fb96fa68b29f" />

<img width="904" alt="Screenshot 2025-06-26 at 12 28 29 PM" src="https://github.com/user-attachments/assets/42626892-27e1-4e69-9efc-c9baf80c5384" />

<img width="752" alt="Screenshot 2025-06-26 at 12 29 05 PM" src="https://github.com/user-attachments/assets/0b1afb30-5868-4ba6-9985-2cc7994a4227" />
PR https://github.com/pytorch/pytorch/pull/152011
added slight regression

<br class="Apple-interchange-newline">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157010
Approved by: https://github.com/zou3519
2025-06-26 21:56:57 +00:00
Laith Sakka
85df746892 refresh expected numbers (#156877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156877
Approved by: https://github.com/huydhn
2025-06-26 00:03:09 +00:00
IvanKobzarev
313a6a8ef9 [pt2][pr_time_benchmarks] Refresh instructions count after disabled test (#156738)
https://github.com/pytorch/pytorch/issues/153987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156738
Approved by: https://github.com/laithsakka
2025-06-24 23:45:02 +00:00
Tom Ritchford
e2c9d8d641 Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)
Fix https://github.com/pytorch/pytorch/issues/145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845
Approved by: https://github.com/Skylion007
2025-06-24 15:41:34 +00:00
PyTorch MergeBot
e600e044a7 Revert "[aotd] Support mutations of the same input in fw and bw (#155354)"
This reverts commit 3f920f3d8f.

Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see 930b575389/1 ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))
2025-06-24 04:42:14 +00:00
IvanKobzarev
3f920f3d8f [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-23 22:25:45 +00:00
Boyuan Feng
a95504b10f [torchbench] update environment setup script (#156465)
Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model.

This PR adds a config to only install torchbench models we want to run.

Example usage:
```
# Install 1 torchbench model
make build-deps TORCHBENCH_MODELS="alexnet"

# Install 3 torchbench models
make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch"

# Install all models
make build-deps

# Install all models
make build-deps TORCHBENCH_MODELS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465
Approved by: https://github.com/ezyang
2025-06-23 17:41:29 +00:00
Edward Yang
333e0e6147 Make build-deps drop builds into current venv again (#156200)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200
Approved by: https://github.com/malfet
2025-06-22 00:45:02 +00:00
Xuan Zhang
c2d1b225e6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-06-21 19:57:21 +00:00
PyTorch MergeBot
754c04aa06 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit 0aed855b2b.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))
2025-06-20 20:18:24 +00:00
PyTorch MergeBot
96d082d06b Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385)"
This reverts commit 242eb19c83.

Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))
2025-06-20 18:17:18 +00:00
William Wen
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
atalman
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
Nikita Shulga
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
Xuehai Pan
42015db6a9 [BE] fix typos in benchmarks/ (#156077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #156069
2025-06-17 13:12:18 +00:00
Oguz Ulgen
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
Laith Sakka
4bb936d8b7 refresh expected results (#155817)
some changes landed when the test is recently unstable with out updating the results.
<img width="564" alt="Screenshot 2025-06-12 at 9 26 32 AM" src="https://github.com/user-attachments/assets/9a83f18b-f2a8-485d-a58e-67d8c161eb18" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155817
Approved by: https://github.com/yushangdi
2025-06-12 19:14:21 +00:00
Laith Sakka
cc09d3a5ba remove float args benchmark (#155674)
This benchmark very sensitive. removing it for now until we make it better .

<img width="755" alt="Screenshot 2025-06-11 at 12 01 25 AM" src="https://github.com/user-attachments/assets/01a45ae5-2028-42a2-b819-c30d4db3b5d4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155674
Approved by: https://github.com/bdhirsh, https://github.com/bobrenjc93
2025-06-11 20:34:58 +00:00
Oguz Ulgen
d1947a8707 Migrate from lru_cache to cache (#155613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
Animesh Jain
c881f2ddf3 [reland][dynamo] Mark a vt unspecialized nn module variable source earlier (#155099)
Reland of https://github.com/pytorch/pytorch/pull/154780

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155099
Approved by: https://github.com/williamwen42
2025-06-04 23:05:36 +00:00
Jeff Daily
3ce5102927 [ROCm] fix CI failures from inductor periodic (#154896)
Similar idea as https://github.com/pytorch/pytorch/pull/154497, but for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154896
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-04 15:28:43 +00:00
PyTorch MergeBot
a99a01a677 Revert "[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)"
This reverts commit cc96febb97.

Reverted https://github.com/pytorch/pytorch/pull/154780 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
eellison
40a8770154 Incorporate coalesce analysis in codegen (#153751)
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
2025-06-04 00:22:57 +00:00
Animesh Jain
cc96febb97 [dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)
I am working on providing some skip guard helper functions to allow users to reduce guard overhead. This is a refactor to allow that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154780
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-03 19:19:47 +00:00
bobrenjc93
28f27886eb Vary batch size when running dynamic shapes benchmarks (#154805)
This better measures the actual runtime performance of dynamic shapes
where we aren't guaranteed to have similar shapes as the original hint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826, #154822, #154823
2025-06-02 18:56:18 +00:00
bobrenjc93
b90fc2ec27 [ez] delete code that died a long time ago (#154802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802
Approved by: https://github.com/Skylion007
2025-06-01 14:57:03 +00:00
Pearu Peterson
6a781619bf Temporarily disable sparse tensor validation when loading from external storage. (#154758)
As in the title per https://github.com/pytorch/pytorch/issues/153143#issuecomment-2917793067 .

The plan is to workout a solution that will allow (1) disabling pinned memory check to fix the original issue and (2) switching off the sparse tensor validation for maximal performance in loading sparse tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154758
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-05-31 19:45:44 +00:00
Aaron Orenstein
fc0135ca11 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-30 17:23:36 +00:00
Isuru Fernando
9ba67e99bb [dynamo] keep C++ symbolic shape guards disabled for benchmarks (#151225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151225
Approved by: https://github.com/anijain2305
2025-05-29 23:29:39 +00:00
Laith Sakka
39df901b2a introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
Boyuan Feng
514409d032 update torchvision pin (#154255)
Fixes #153985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255
Approved by: https://github.com/desertfire
2025-05-27 16:15:25 +00:00
PyTorch MergeBot
3f64502c98 Revert "Re-enable FakeTensor caching for SymInts (#152662)"
This reverts commit 7d11c61c26.

Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see 187d38185e/1 ([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593))
2025-05-26 17:13:22 +00:00
Aaron Orenstein
7d11c61c26 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-26 04:17:56 +00:00
Eddie Yan
76ed9db468 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-24 03:43:35 +00:00
Laith Sakka
9e089bb5b6 change guard_or impl for better perf and simplicity (#153674)
PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache.
This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674
Approved by: https://github.com/bobrenjc93
2025-05-23 15:24:28 +00:00
Huy Do
7509b150af Don't upload compiler benchmark debug info to the benchmark database (#153769)
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect.  After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm.  In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE))

```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```

Let's skip uploading them to avoid bloating the database.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
Benjamin Glass
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
Huy Do
6cd9d66b7f Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109)
After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8.  I can reproduce it locally with:

```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet

E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```

I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.

Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-22 14:25:12 +00:00
Gabriel Ferns
254293b777 Add flag _metrics_log_runtime to disable runtime metric logging by default (#153506)
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).

Update: added support for TORCH_LOGS for the metrics logging.

Test Plan:
mm_loop.py and many existing tests cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
2025-05-22 01:02:11 +00:00
PyTorch MergeBot
3443627e07 Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)"
This reverts commit 4f4ecc583e.

Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))
2025-05-16 08:29:26 +00:00
PyTorch MergeBot
4d073af58c Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 725bbb6b5f.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862))
2025-05-16 08:20:39 +00:00
Nikita Shulga
754b758ea1 [BE] Extend empty_gpu_cache to mps (#153657)
And replace `if: elif:` with `getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-16 01:08:54 +00:00
Aaron Gokaslan
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
karthickai
725bbb6b5f [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel
2025-05-15 02:33:57 +00:00
Animesh Jain
03d01860fd [dynamo][compile-time] Compute logging related flags once (#153426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426
Approved by: https://github.com/jansel
2025-05-14 21:19:06 +00:00
Animesh Jain
8f3d7972ad [dynamo][compile-time] Cache the function signature to speedup inlining (#153396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #153333
2025-05-14 14:01:46 +00:00
Animesh Jain
864a5f4434 [dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
Animesh Jain
11c64b7cf8 [dynamo][compile-time] Cache whether a function is inlineable (#153192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #153458
2025-05-14 05:40:25 +00:00
Benjamin Glass
e8596c291b Fix misleadingly high AOT Inductor dashboard performance (#153060)
Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good.

An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x.

Only two benchmarks we regularly run are affected by this, both in the TIMM set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060
Approved by: https://github.com/desertfire
2025-05-13 20:59:59 +00:00
Michael Lazos
ff039d39ec [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-13 12:17:59 +00:00
Laith Sakka
c4fb0b6f33 refresh expected results (#150166)
@huydhn when do you think we will have the APIs to access results on oss storage available so we do not
have to worry about this racing again?
Also is there a way to accelerate unstability in this after we land it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150166
Approved by: https://github.com/bobrenjc93, https://github.com/eellison, https://github.com/anijain2305
2025-05-13 04:04:42 +00:00
Aaron Gokaslan
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00
PyTorch MergeBot
aa7fe6af41 Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589)"
This reverts commit b5f1345f72.

Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
Michael Lazos
b5f1345f72 [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-10 08:27:56 +00:00
Pian Pawakapan
d808a3e203 [dynamic shapes] guard_or_false for computeStorageNbytes (#150483)
removes fast path for computing storage, fixes some adjacent tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483
Approved by: https://github.com/laithsakka
2025-05-09 19:31:19 +00:00
Pian Pawakapan
8ea95d2e73 [inductor] dtype promotion error in cat decomp (#152995)
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
2025-05-09 16:58:58 +00:00
Animesh Jain
ab829ec629 [dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlining (#153159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153159
Approved by: https://github.com/laithsakka
ghstack dependencies: #152883, #153105
2025-05-09 00:09:19 +00:00