Commit Graph

1215 Commits

Author SHA1 Message Date
Edward Yang
333e0e6147 Make build-deps drop builds into current venv again (#156200)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200
Approved by: https://github.com/malfet
2025-06-22 00:45:02 +00:00
Xuan Zhang
c2d1b225e6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-06-21 19:57:21 +00:00
PyTorch MergeBot
754c04aa06 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit 0aed855b2b.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))
2025-06-20 20:18:24 +00:00
PyTorch MergeBot
96d082d06b Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385)"
This reverts commit 242eb19c83.

Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))
2025-06-20 18:17:18 +00:00
William Wen
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
atalman
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
Nikita Shulga
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
Xuehai Pan
42015db6a9 [BE] fix typos in benchmarks/ (#156077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156077
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #156069
2025-06-17 13:12:18 +00:00
Oguz Ulgen
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
Laith Sakka
4bb936d8b7 refresh expected results (#155817)
some changes landed when the test is recently unstable with out updating the results.
<img width="564" alt="Screenshot 2025-06-12 at 9 26 32 AM" src="https://github.com/user-attachments/assets/9a83f18b-f2a8-485d-a58e-67d8c161eb18" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155817
Approved by: https://github.com/yushangdi
2025-06-12 19:14:21 +00:00
Laith Sakka
cc09d3a5ba remove float args benchmark (#155674)
This benchmark very sensitive. removing it for now until we make it better .

<img width="755" alt="Screenshot 2025-06-11 at 12 01 25 AM" src="https://github.com/user-attachments/assets/01a45ae5-2028-42a2-b819-c30d4db3b5d4" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155674
Approved by: https://github.com/bdhirsh, https://github.com/bobrenjc93
2025-06-11 20:34:58 +00:00
Oguz Ulgen
d1947a8707 Migrate from lru_cache to cache (#155613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613
Approved by: https://github.com/ezyang
ghstack dependencies: #155612
2025-06-11 19:44:18 +00:00
Animesh Jain
c881f2ddf3 [reland][dynamo] Mark a vt unspecialized nn module variable source earlier (#155099)
Reland of https://github.com/pytorch/pytorch/pull/154780

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155099
Approved by: https://github.com/williamwen42
2025-06-04 23:05:36 +00:00
Jeff Daily
3ce5102927 [ROCm] fix CI failures from inductor periodic (#154896)
Similar idea as https://github.com/pytorch/pytorch/pull/154497, but for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154896
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-04 15:28:43 +00:00
PyTorch MergeBot
a99a01a677 Revert "[dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)"
This reverts commit cc96febb97.

Reverted https://github.com/pytorch/pytorch/pull/154780 on behalf of https://github.com/seemethere due to This fails internal testing see, https://fburl.com/diff/b0yuxk4w ([comment](https://github.com/pytorch/pytorch/pull/154780#issuecomment-2940381691))
2025-06-04 15:03:34 +00:00
eellison
40a8770154 Incorporate coalesce analysis in codegen (#153751)
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
2025-06-04 00:22:57 +00:00
Animesh Jain
cc96febb97 [dynamo] Mark a vt unspecialized nn module variable source earlier (#154780)
I am working on providing some skip guard helper functions to allow users to reduce guard overhead. This is a refactor to allow that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154780
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-06-03 19:19:47 +00:00
bobrenjc93
28f27886eb Vary batch size when running dynamic shapes benchmarks (#154805)
This better measures the actual runtime performance of dynamic shapes
where we aren't guaranteed to have similar shapes as the original hint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154805
Approved by: https://github.com/Skylion007
ghstack dependencies: #154802, #154826, #154822, #154823
2025-06-02 18:56:18 +00:00
bobrenjc93
b90fc2ec27 [ez] delete code that died a long time ago (#154802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154802
Approved by: https://github.com/Skylion007
2025-06-01 14:57:03 +00:00
Pearu Peterson
6a781619bf Temporarily disable sparse tensor validation when loading from external storage. (#154758)
As in the title per https://github.com/pytorch/pytorch/issues/153143#issuecomment-2917793067 .

The plan is to workout a solution that will allow (1) disabling pinned memory check to fix the original issue and (2) switching off the sparse tensor validation for maximal performance in loading sparse tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154758
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-05-31 19:45:44 +00:00
Aaron Orenstein
fc0135ca11 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Differential Revision: [D75467694](https://our.internmc.facebook.com/intern/diff/D75467694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-30 17:23:36 +00:00
Isuru Fernando
9ba67e99bb [dynamo] keep C++ symbolic shape guards disabled for benchmarks (#151225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151225
Approved by: https://github.com/anijain2305
2025-05-29 23:29:39 +00:00
Laith Sakka
39df901b2a introduce definitely_contiguous and use it for reshape and tensor meta data computation. (#153432)
when a tensor has unbacked symbols it can be general enough to represent both contiguous and non contiguous tensors.
in that case we cant really evaluate is_contiguous. In many places in the code base, we check for is_contiguous to take a fast path. but the general path usually works for both contiguous and not contiguous in that case we probably want
to use definitely _contiguous API.

This is appleid for reshape in this PR and also to  tensor meta data computation, the meta data now will have an attribute that says that its contiguous when its always contiguous. We would store that only if definitely _contiguous is true  now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153432
Approved by: https://github.com/bobrenjc93
2025-05-28 03:41:26 +00:00
Boyuan Feng
514409d032 update torchvision pin (#154255)
Fixes #153985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154255
Approved by: https://github.com/desertfire
2025-05-27 16:15:25 +00:00
PyTorch MergeBot
3f64502c98 Revert "Re-enable FakeTensor caching for SymInts (#152662)"
This reverts commit 7d11c61c26.

Reverted https://github.com/pytorch/pytorch/pull/152662 on behalf of https://github.com/malfet due to Looks like it broke bunch of inductor tests, see 187d38185e/1 ([comment](https://github.com/pytorch/pytorch/pull/152662#issuecomment-2910293593))
2025-05-26 17:13:22 +00:00
Aaron Orenstein
7d11c61c26 Re-enable FakeTensor caching for SymInts (#152662)
Summary:

This backs out D60320595 which itself turned off FakeTensor caching when a SymInt was present.

There has been a lot of dynamic shape fixes done this year and tests pass so I'm assuming some of that work fixed what was breaking previously.

Test Plan: Reran the tests listed in T196779132 and they pass.

## Perf
### Instruction Counter Benchmark:
- 26% win on add_loop_eager_dynamic
- 13% win on add_loop_inductor_dynamic_gpu
### Perf Dashboard
Compilation Latency wins across the board but especially strong on the dynamic tests (like cudagraphs_dynamic) - for example MobileBertForMaskedLM went from 66s -> 50s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152662
Approved by: https://github.com/anijain2305
2025-05-26 04:17:56 +00:00
Eddie Yan
76ed9db468 [cuBLAS][cuBLASLt] Use cuBLAS default workspace size in Lt (#153556)
Also enables unified workspaces by default for non-FBCODE use cases.
Default Lt workspace size is also updated to match cuBLAS logic for default, including for Blackwell (SM 10.0) and GeForce Blackwell (SM 12.0).

Recommended defaults are documented here:
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153556
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-05-24 03:43:35 +00:00
Laith Sakka
9e089bb5b6 change guard_or impl for better perf and simplicity (#153674)
PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache.
This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674
Approved by: https://github.com/bobrenjc93
2025-05-23 15:24:28 +00:00
Huy Do
7509b150af Don't upload compiler benchmark debug info to the benchmark database (#153769)
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect.  After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm.  In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE))

```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```

Let's skip uploading them to avoid bloating the database.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
Benjamin Glass
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
Huy Do
6cd9d66b7f Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109)
After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8.  I can reproduce it locally with:

```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet

E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```

I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.

Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-22 14:25:12 +00:00
Gabriel Ferns
254293b777 Add flag _metrics_log_runtime to disable runtime metric logging by default (#153506)
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).

Update: added support for TORCH_LOGS for the metrics logging.

Test Plan:
mm_loop.py and many existing tests cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
2025-05-22 01:02:11 +00:00
PyTorch MergeBot
3443627e07 Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473)"
This reverts commit 4f4ecc583e.

Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))
2025-05-16 08:29:26 +00:00
PyTorch MergeBot
4d073af58c Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 725bbb6b5f.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/jeanschmidt due to seems to have broken a few internal tests, @jansel may you help the author get his PR merged? ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2885997862))
2025-05-16 08:20:39 +00:00
Nikita Shulga
754b758ea1 [BE] Extend empty_gpu_cache to mps (#153657)
And replace `if: elif:` with `getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153657
Approved by: https://github.com/atalman, https://github.com/wdvr, https://github.com/ZainRizvi
2025-05-16 01:08:54 +00:00
Aaron Gokaslan
4f4ecc583e [BE]: Enable RUFF TRY400 rule - log.exception (#153473)
Change logging.error to logging.exception to log additional information when relevant.  A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-15 13:36:59 +00:00
karthickai
725bbb6b5f [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel
2025-05-15 02:33:57 +00:00
Animesh Jain
03d01860fd [dynamo][compile-time] Compute logging related flags once (#153426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153426
Approved by: https://github.com/jansel
2025-05-14 21:19:06 +00:00
Animesh Jain
8f3d7972ad [dynamo][compile-time] Cache the function signature to speedup inlining (#153396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153396
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #153333
2025-05-14 14:01:46 +00:00
Animesh Jain
864a5f4434 [dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
Animesh Jain
11c64b7cf8 [dynamo][compile-time] Cache whether a function is inlineable (#153192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153192
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #153458
2025-05-14 05:40:25 +00:00
Benjamin Glass
e8596c291b Fix misleadingly high AOT Inductor dashboard performance (#153060)
Fixes misleadingly high AOTInductor performance benchmark numbers in scenarios where a model updates internal parameters during `torch.export.export`. Since `FakeTensorMode` is enabled during export, all such parameters become `FakeTensor`s, slowing down future eager-mode runs using that model substantively. This, in turn, causes misleading performance stats, where the slowness of eager-mode makes `AOTInductor` look _very_ good.

An [example benchmark](https://hud.pytorch.org/benchmark/timm_models/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2030%20Apr%202025%2015%3A54%3A04%20GMT&stopTime=Wed%2C%2007%20May%202025%2015%3A54%3A04%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=1dd36ad2d440a4f3faf724b3a8e13925e3180c24&rBranch=main&rCommit=cc7346bf19c019255dcb4484694a75850ed74d5a&model=convit_base) with this issue. The equivalent `cpp_wrapper` benchmark run shows a 2x performance gain, not 20x.

Only two benchmarks we regularly run are affected by this, both in the TIMM set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153060
Approved by: https://github.com/desertfire
2025-05-13 20:59:59 +00:00
Michael Lazos
ff039d39ec [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-13 12:17:59 +00:00
Laith Sakka
c4fb0b6f33 refresh expected results (#150166)
@huydhn when do you think we will have the APIs to access results on oss storage available so we do not
have to worry about this racing again?
Also is there a way to accelerate unstability in this after we land it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150166
Approved by: https://github.com/bobrenjc93, https://github.com/eellison, https://github.com/anijain2305
2025-05-13 04:04:42 +00:00
Aaron Gokaslan
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00
PyTorch MergeBot
aa7fe6af41 Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589)"
This reverts commit b5f1345f72.

Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
Michael Lazos
b5f1345f72 [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-10 08:27:56 +00:00
Pian Pawakapan
d808a3e203 [dynamic shapes] guard_or_false for computeStorageNbytes (#150483)
removes fast path for computing storage, fixes some adjacent tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483
Approved by: https://github.com/laithsakka
2025-05-09 19:31:19 +00:00
Pian Pawakapan
8ea95d2e73 [inductor] dtype promotion error in cat decomp (#152995)
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
2025-05-09 16:58:58 +00:00
Animesh Jain
ab829ec629 [dynamo][pr_time_benchmark] Add dynamo benchmark to stress test inlining (#153159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153159
Approved by: https://github.com/laithsakka
ghstack dependencies: #152883, #153105
2025-05-09 00:09:19 +00:00
Pian Pawakapan
4166373908 [dynamic shapes] guard_or_false for infer_size (#152146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146
Approved by: https://github.com/laithsakka
2025-05-08 21:27:22 +00:00
Animesh Jain
ecd74c953f [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-07 02:36:44 +00:00
Aaron Gokaslan
07a29dbe81 [BE]: Update cutlass submodule to 3.9.2 (#152779)
A lot of last minute bugfixes for CUTLASS blackwell that we should upstream. It's a header only library and a minor release so this should strictly improve compiler support and fix some bugs. Needed to update some instruction numbers in torch compile baselines for the new kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152779
Approved by: https://github.com/henrylhtsang
2025-05-06 16:08:24 +00:00
PyTorch MergeBot
fcd5e49138 Revert "[dynamo] Recursively realize the stack_values (#152853)"
This reverts commit 460888f908.

Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485))
2025-05-06 15:02:57 +00:00
Animesh Jain
460888f908 [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-06 06:30:31 +00:00
rzou
2b37a726e0 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-05-03 00:02:24 +00:00
rzou
64957db6c9 Fix some inductor periodic benchmarks (#152605)
Some were reporting "pass" consistently on https://hud.pytorch.org/
Those are fine to flip.

I filed a separate issue for the now-regressions for AOTI:
https://github.com/pytorch/pytorch/issues/152606. These should be looked
at.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152605
Approved by: https://github.com/eellison, https://github.com/huydhn
2025-05-01 22:18:30 +00:00
Jason Ansel
15a3f58f91 Return ConstantVariable(None) from WithExitFunctionVariable.exit to prevent NoneType crash inside autocast exception path (#152503)
Copy of #152013 with PR time benchmarks updated (regressions seem unrelated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152503
Approved by: https://github.com/anijain2305, https://github.com/Skylion007

Co-authored-by: Witold Dziurdz <wdziurdz@habana.ai>
2025-05-01 04:01:24 +00:00
Huy Do
3f10091d3c Clean up conda usage in benchmark scripts (#152552)
Fixes https://github.com/pytorch/pytorch/issues/152123.

* Switch `benchmarks/dynamo/Makefile` to use uv.  Note that these scripts are only used locally, so it's kind of ok to keep conda here IMO.  But switching to uv is probably nicer to most folks.
* Delete some files that are outdated and not used anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152552
Approved by: https://github.com/atalman, https://github.com/albanD
2025-04-30 21:27:29 +00:00
Gabriel Ferns
ce00ec7ecf Enable max autotune for AOTInductor benchmark (#149309)
With this PR, AOTinductor can choose to run into max-autotune mode when benchmarking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149309
Approved by: https://github.com/desertfire

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
2025-04-28 06:54:26 +00:00
Anthony Shoumikhin
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
Animesh Jain
3c1a17a08b [Dynamo] Use LazyVariableTracker in base VT (#151847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151847
Approved by: https://github.com/StrongerXi
2025-04-23 18:18:01 +00:00
Eddie Yan
dcc32ff5bf [CUDA][cuBLAS][cuBLASLt] Opt-in unified cuBLAS + cuBLASLt workspaces (#151163)
opt-in version of https://github.com/pytorch/pytorch/pull/145130 as there was a lack of repro for the 70% forward issue
`TORCH_CUBLASLT_UNIFIED_WORKSPACE=1`

@izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151163
Approved by: https://github.com/ngimel
2025-04-23 15:24:22 +00:00
Laith Sakka
09e8ff92cc refresh benchmark results (#151622)
updating due to <1.5% increases in https://github.com/pytorch/pytorch/pull/151469
not all benchmarks were updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151622
Approved by: https://github.com/oulgen
2025-04-18 02:39:13 +00:00
Oguz Ulgen
ef64beb232 Include post grad gm and fx runnable in cache artifacts for tlparse (#151469)
Fixed #151462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151469
Approved by: https://github.com/bdhirsh
2025-04-17 17:14:13 +00:00
PyTorch MergeBot
41b82611ee Revert "[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756)"
This reverts commit 300e0ee13c.

Reverted https://github.com/pytorch/pytorch/pull/144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with  TypeError: unsupported operand type(s) for |: 'set' and 'list' ([comment](https://github.com/pytorch/pytorch/pull/144756#issuecomment-2812525970))
2025-04-17 11:09:01 +00:00
Mao Yunfei
300e0ee13c [Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756)
Reopen the previous stale closed PR https://github.com/pytorch/pytorch/pull/134192

We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device.
This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144756
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire
2025-04-17 00:26:55 +00:00
Brian Hirsh
eea4a7b424 update expected results for comptime benchmark (#151319)
This PR https://github.com/pytorch/pytorch/pull/150594 bumped the benchmark up by ~1%, a bit under our 1.5% "regression" mark.

Modeled this PR after https://github.com/pytorch/pytorch/pull/144274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151319
Approved by: https://github.com/jamesjwu, https://github.com/laithsakka
2025-04-15 19:40:13 +00:00
Shunting Zhang
8f440a8e70 don't return logits for benchmark script (#151075)
PT2 benchmark scripts has a pattern like:
```
    def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
        cloned_inputs = clone_inputs(inputs)
        self.optimizer_zero_grad(mod)
        with self.autocast(**self.autocast_arg):
            pred = mod(**cloned_inputs)
            loss = self.compute_loss(pred)
        self.grad_scaler.scale(loss).backward()
        self.optimizer_step()
        if collect_outputs:
            return collect_results(mod, pred, loss, cloned_inputs)
        return None
```
for training.

The collect_outputs argument is True only for accuracy testing and it's false for performance testing.

For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow.

A few bad things if we keep logits here
1. the peak memory will be higher since the logits is large and we can not release its memory earlier.
2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph

Actually I think it's fine to not return logits at all.
- For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients.
- Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us.

On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d))

Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a)
- HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement
- I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075
Approved by: https://github.com/eellison, https://github.com/jansel
2025-04-15 17:13:00 +00:00
Zhang, Jianyi
a756c50315 [Intel GPU] Avoid using fp32 in sdp math path when benchmark performance. (#150996)
sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device.

I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996
Approved by: https://github.com/drisspg, https://github.com/EikanWang
2025-04-15 08:08:01 +00:00
Animesh Jain
7b1a2373e8 [dynamo][super variable] Fix bug to use correct source (#151154)
Fixes https://github.com/pytorch/pytorch/issues/150994

We should cherry-pick to 2.7 branch if possible, because this breaks torch.compile on some HF models. Look at the issue referenced here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151154
Approved by: https://github.com/jansel
2025-04-13 04:48:52 +00:00
PyTorch MergeBot
67d3053d4b Revert "update benchamark result due to <1% regression (#150937)"
This reverts commit 860765d621.

Reverted https://github.com/pytorch/pytorch/pull/150937 on behalf of https://github.com/laithsakka due to regression diff reverted ([comment](https://github.com/pytorch/pytorch/pull/150937#issuecomment-2797611127))
2025-04-11 17:36:47 +00:00
Laith Sakka
91d1826539 Add dynamic version for mm_loop benchmark (#150865)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150865
Approved by: https://github.com/eellison
2025-04-09 23:37:43 +00:00
Laith Sakka
860765d621 update benchamark result due to <1% regression (#150937)
<img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" />

PR https://hud.pytorch.org/pr/148104
which is acceptable but we have to update this to avoid  flakiness in the future .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150937
Approved by: https://github.com/zou3519
2025-04-09 20:25:48 +00:00
Bin Bao
6a8ab902a2 [AOTI][dashboard] Fix mis-calculated memory compression ratio (#150695)
Summary: https://github.com/pytorch/pytorch/pull/149817 introduced an extra warmup run to compute AOTI memory compression ratio, but since weights are only loaded once in the AOTI run, the peak memory seen in the extra warmup won't include the weight, which causes an aritifically high memory compression ratio. This PR removes that extra warmup run, and calls reset_peak_memory_stats in the proper place instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150695
Approved by: https://github.com/yushangdi
2025-04-06 19:51:22 +00:00
Laith Sakka
3320efef6b Refresh expected results. (#150264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150264
Approved by: https://github.com/bobrenjc93
2025-04-05 01:11:19 +00:00
Jason Ansel
d41c22b578 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)" (#150542)
Reverts #148261 due to possible memory leak

This reverts commit 5d4e7d58b4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150542
Approved by: https://github.com/clee2000
2025-04-03 21:15:38 +00:00
Bin Bao
d4c30b4599 [AOTI][dashboard] Update how peak memory is measured (#150534)
Summary: In the dashboard measurement script, AOTI needs to run Eager first to register the output pytree, so the peak memory compression ratio on the dashboard is always close to 1. Update AOTI run to use an extra warmup run, so the peak memory compression ratio measures the result at the run time instead of the compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150534
Approved by: https://github.com/yushangdi
2025-04-03 12:01:43 +00:00
PyTorch MergeBot
203a27e0ce Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 8f7fbe3d7d.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))
2025-04-01 23:07:28 +00:00
Xuehai Pan
a10b765bf1 [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-04-01 10:40:43 +00:00
Zhang, Jianyi
0f12951fc2 [Intel gpu] always set deterministic for xpu accuracy test (#149028)
On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028
Approved by: https://github.com/guangyey, https://github.com/desertfire

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-04-01 01:00:11 +00:00
LifengWang
51f0403f46 Update the baseline for max_autotune ci workflow (#149107)
Since the issue https://github.com/pytorch/pytorch/issues/148535 is fixed in PR https://github.com/pytorch/pytorch/pull/148923, update the baseline for max_autotune ci workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149107
Approved by: https://github.com/chuanqi129, https://github.com/leslie-fang-intel, https://github.com/desertfire
2025-03-31 09:45:44 +00:00
IvanKobzarev
25309a17f0 [aotd] Config to guess_tangents_stride (#150035)
Differential Revision: [D71907684](https://our.internmc.facebook.com/intern/diff/D71907684)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150035
Approved by: https://github.com/ilyas409, https://github.com/seemethere
2025-03-28 13:54:19 +00:00
Laith Sakka
7379c66344 add loop mm benchmark (#149932)
results:
compile time instruction count for iteration 4 is 67947323682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149932
Approved by: https://github.com/bobrenjc93, https://github.com/eellison
2025-03-26 11:21:30 +00:00
Laith Sakka
6c9d48b32b refresh results of benchmarks (#149936)
while the test was disabled, I put a fix but another win change landed before the test was restored
to it stayed disabled.
<img width="698" alt="Screenshot 2025-03-24 at 6 26 36 PM" src="https://github.com/user-attachments/assets/2713c685-aee2-4dea-9a6c-cad01ef575cd" />
caused by
https://github.com/pytorch/pytorch/pull/149295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149936
Approved by: https://github.com/bobrenjc93
2025-03-25 21:01:08 +00:00
Benjamin Glass
23855391f1 Add regression tests for 3 missing PR-time benchmarks (#149423)
Uses values from the latest PR-time benchmark run on viable/strict. See https://github.com/pytorch/pytorch/actions/runs/13898520615/job/38900894469 for a job showing why this is needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149423
Approved by: https://github.com/laithsakka
2025-03-24 23:39:36 +00:00
Simon Fan
86ee3bf3d5 [ca] use torch.compile ca API for benchmarks (#149647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149647
Approved by: https://github.com/jansel
2025-03-24 19:06:45 +00:00
eqy
8f7fbe3d7d [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-03-22 05:50:11 +00:00
Pian Pawakapan
e0e8639a10 [torchbench] fix dynamic_shapes spec for moco (#148772)
Fixes https://github.com/pytorch/pytorch/issues/148333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-03-18 18:16:54 +00:00
Laith Sakka
6055a4f612 refresh benchmarks results. (#149347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347
Approved by: https://github.com/jamesjwu
2025-03-18 08:53:49 +00:00
Shunting Zhang
6c7d8419e3 fix two accuracy regression (#149172)
There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check.

- error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316
- error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-17 19:34:00 +00:00
PyTorch MergeBot
f9b4856989 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit c95a6b416b.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))
2025-03-14 23:13:34 +00:00
Xuehai Pan
c95a6b416b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-14 08:50:30 +00:00
LifengWang
e40a9e602b Add the max_autotune tests in the periodic jobs. (#143560)
To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2025-03-12 01:47:46 +00:00
Bin Bao
f69e58e8e8 [CI] Update crossvit_9_240 as pass (#148989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989
Approved by: https://github.com/ZainRizvi
2025-03-11 20:54:39 +00:00
Rengan Xu
da4bb72a71 Backout D70075331 (#148824)
Summary:
The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0"

So we revert D70075331 as a workaround now.

Test Plan: The model could be lowered and published successfully. e.g. 702869739_16

Differential Revision: D70823254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824
Approved by: https://github.com/eqy
2025-03-11 12:51:17 +00:00
PyTorch MergeBot
ebd087e4b5 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit f08146b67b.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))
2025-03-10 17:19:21 +00:00
Jason Ansel
a60b4ed623 [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148288
2025-03-10 16:06:19 +00:00
Jason Ansel
8f858e226b [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261
2025-03-10 16:06:19 +00:00
Jason Ansel
5d4e7d58b4 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-10 16:06:11 +00:00
Jason Ansel
bf752c36da [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-10 16:06:02 +00:00
Jason Ansel
bec7bdad47 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-10 16:05:53 +00:00
atalman
2068235c0a Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD (#148788)
After https://github.com/pytorch/pytorch/pull/148612
This model have become flaky

Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-03-10 13:40:41 +00:00
Jason Ansel
9a1a2e1516 Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
2025-03-09 17:12:47 +00:00
Ting Lu
9769618d35 [CI] [inductor] Add cu126 inductor jobs and move away cu124 (#148612)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock

Seems many inductor yml are added after initial change was prepared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612
Approved by: https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 18:30:14 +00:00
Shunting Zhang
262411e48b [inductor] online softmax (#127011)
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row

When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.

Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).

Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54

## Microbenchmark

- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
  - eager_ms=6.671296119689941
  - opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
  - eager_ms=6.634047985076904
  - opt_ms=6.230591773986816

Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
Xuehai Pan
f08146b67b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-06 18:59:02 +00:00
Bin Bao
d10bacd4ce [AOTI][dashboard] Skip torchbench models not supported by export (#148359)
Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359
Approved by: https://github.com/angelayi, https://github.com/ysiraichi
2025-03-06 18:08:17 +00:00
Laith Sakka
913356fb41 Fix recent regression in evaluate_expr that effect cache lookups (#147836)
PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging.
This caused a regression that we thought is due to calling id on symnode.

I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache
lookups.
I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr
#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836
Approved by: https://github.com/oulgen
2025-03-05 04:11:41 +00:00
PyTorch MergeBot
92beda54c8 Revert "[fx] Move map_aggregate to C++ (#148243)"
This reverts commit edaff88f69.

Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
17d003fe75 Revert "[fx] Move Node._update_args_kwargs to C++ (#148260)"
This reverts commit 0135f57f4a.

Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
97b9e68bc6 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)"
This reverts commit 29c2de9ae1.

Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
6fb18ff685 Revert "Better log message to update pr_time_benchmarks/expected_results.csv (#148303)"
This reverts commit a3d69e6e1a.

Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
611b0e9bc4 Revert "[fx] Optimizations for node name generation (#148288)"
This reverts commit 5eb0337cfd.

Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
PyTorch MergeBot
ed9055c303 Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)"
This reverts commit 8531d247ba.

Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
Jason Ansel
8531d247ba [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303, #148288
2025-03-04 02:42:23 +00:00
Jason Ansel
5eb0337cfd [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303
2025-03-04 02:42:23 +00:00
Jason Ansel
a3d69e6e1a Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
ghstack dependencies: #148243, #148260, #148261
2025-03-04 02:42:23 +00:00
Jason Ansel
29c2de9ae1 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-02 22:42:31 +00:00
Jason Ansel
0135f57f4a [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-02 22:42:31 +00:00
Jason Ansel
edaff88f69 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-02 22:42:31 +00:00
Boyuan Feng
6e10471966 [ci] disable cudagraph for tts_angular on dashboard (#148221)
tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular.

[Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221
Approved by: https://github.com/eellison
2025-03-02 03:31:19 +00:00
Xuehai Pan
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
Katarzyna Fojcik
edaf9ddeb5 Add basic Gaudi support to benchmarks/dynamo (#145920)
This PR adds basic Gaudi support to benchmarks/dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
Oguz Ulgen
bb7e8fbd66 [CacheBench] Add hf_T5 llama moco to cachebench (#147783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781, #147782
2025-02-25 04:34:45 +00:00
Oguz Ulgen
895564d6b6 [CacheBench] Add huggingface (#147782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781
2025-02-25 04:34:45 +00:00
Oguz Ulgen
c4fb6ae55d [CacheBench] Separate dynamic into its own option (#147781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780
2025-02-25 04:34:34 +00:00
Oguz Ulgen
60d4cbfc06 [CacheBench] Add repeat option so that we can have more accurate cache results (#147780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780
Approved by: https://github.com/huydhn
ghstack dependencies: #147688
2025-02-25 04:34:25 +00:00
Oguz Ulgen
ab3b814af3 [CacheBench] Add ciflow/trunk test (#147688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688
Approved by: https://github.com/huydhn
2025-02-25 04:34:16 +00:00
Xuehai Pan
754fb834db [BE][CI] bump ruff to 0.9.0: string quote styles (#144569)
Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting

- Change the outer quotes to double quotes for nested f-strings

```diff
- f'{", ".join(args)}'
+ f"{', '.join(args)}"
```

- Change the inner quotes to double quotes for triple f-strings

```diff
  string = """
-     {', '.join(args)}
+     {", ".join(args)}
  """
```

- Join implicitly concatenated strings

```diff
- string = "short string " "short string " f"{var}"
+ string = f"short string short string {var}"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569
Approved by: https://github.com/Skylion007
ghstack dependencies: #146509
2025-02-24 19:56:09 +00:00
eqy
718cf68aee [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
Aaron Orenstein
086d146f6f Update ruff linter for PEP585 (#147540)
This turns on PEP585 enforcement in RUFF.

- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
Oguz Ulgen
1c334893dc [CacheBench] Refactor code to prepare for mode benchmarks (#147641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641
Approved by: https://github.com/huydhn
2025-02-22 00:20:54 +00:00
Oguz Ulgen
a8ce4d1846 Add cachebench (#147537)
This PR adds a new benchmark called cachebench in order to measure/demonstrate the prowess of PT2 caching.
```
python benchmarks/dynamo/cachebench.py --output="result.json"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147537
Approved by: https://github.com/jamesjwu
2025-02-21 17:06:45 +00:00
Animesh Jain
76ad19a549 [dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425)
This reduces fixed overhead seen in a few internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-02-20 05:42:52 +00:00
angelayi
84abeaad5c [export] Log evaluate_expr (#146939)
We want to log each symnode created so that we can do provenance tracking in the tlparse report generated for draft export. To do this, we want to assign a unique id to every symnode, which python's `id` function already does, and then for every expression created, we can find the provenance by tracing back through its arguments ids. This logging only happens when dtrace_structured is enabled, which is only when running draft export.

An example output is as follows:

<img width="799" alt="image" src="https://github.com/user-attachments/assets/88bb31b4-8c31-43fb-aa88-08b573b9f71d" />

For the increase in the compile_time_instruction_count benchmark, this seems unavoidable because I need to call `id` to get the unique identifier for each symnode. But I believe `id` is an inexpensive operation, so hopefully it should be ok?  I tried doing the following:
* Originally I was passing around `self`, which is a SymNode, which caused the compile time to be ~6.36M
* I changed it to pass around `id(self)` instead, which reduced the compile time to ~6.33M
* Then I changed it to be passed as a positional arg instead of a kwarg, which reduced the compile time to ~6.22M, but this doesn't seem to be a super worthwhile fix?

#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146939
Approved by: https://github.com/oulgen
2025-02-18 18:49:51 +00:00
Animesh Jain
71484a2106 [pt2-benchmarks] Compiler reset on every run (#147313)
Internal benchmarks call `run` in a loop. Compiler reset gives a clean env

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313
Approved by: https://github.com/jansel
2025-02-18 02:09:19 +00:00
Aaron Gokaslan
e738f7ba23 [BE]: Enable ruff rule SIM113 (#147290)
Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290
Approved by: https://github.com/jansel
2025-02-16 22:41:16 +00:00
Tom Ritchford
272ead7b5e Make fx.node.map_arg() and .map_aggregate() generic (#146248)
## What's the problem?

The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type.

Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](5d55a6585d/torch/fx/node.py (L48-L58)): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes.

As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements.

## What's the solution?

Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.)

## Won't it break everything?

It doesn't break the type checker - one place needed an extra hint.

There have been code breakages, resolved one, at least one new one... we'll see!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-02-14 19:25:32 +00:00
Aaron Gokaslan
1f41ceb713 [BE][Ez]: Enable ruff rule banning print in assert (#146615)
Enables a few ruff rules
* Ban print statements within asserts (likely bugs)
* ~Use string for Decimal literal to prevent loss of precision~
* ~Do not use default args for __post__init__ in dataclasses, they likely were meant to go into the factory method, the __init__, or somewhere else. The default values are useless here.~

Wait until ruff upgrade for the last 2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146615
Approved by: https://github.com/jansel
2025-02-13 21:14:00 +00:00
Huy Do
04011304e5 Update dynamo expected 20250210 (#146856)
Update all the ci accuracy expect values to make trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146856
Approved by: https://github.com/yanboliang
2025-02-12 18:01:20 +00:00
Animesh Jain
d6513f3246 [dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819)
This PR adds support for list subclasses. Among other things are

1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it.
2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`).
3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods.

If this PR is hard to review, please let me know. I can break it into several small PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-02-12 17:46:02 +00:00
blorange-amd
5fd15a04b7 [ROCm] Enable inductor-periodic testing for MI300 (#144594)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144594
Approved by: https://github.com/malfet, https://github.com/huydhn

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-10 17:42:09 +00:00
Jason Ansel
d35f6b2339 [inductor] Minor compile time optimizations in DefaultHandler (#146282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257
2025-02-08 18:00:40 +00:00
Jason Ansel
403db2faee [inductor] Refactor op handlers part 4 (#146255)
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.

Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254
2025-02-08 18:00:17 +00:00
PyTorch MergeBot
80a1696679 Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 5f0901e573.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))
2025-02-07 21:04:23 +00:00
Animesh Jain
e2e265e27b [dynamo] Use polyfill to implement comparison operators (#144485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485
Approved by: https://github.com/jansel
2025-02-06 17:27:07 +00:00
eqy
5f0901e573 [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-06 05:57:33 +00:00
PyTorch MergeBot
68304dba7a Revert "[inductor] Refactor op handlers part 4 (#146255)"
This reverts commit 7aced455c5.

Reverted https://github.com/pytorch/pytorch/pull/146255 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146255#issuecomment-2638258089))
2025-02-05 23:24:20 +00:00
PyTorch MergeBot
93e1e6e07c Revert "[inductor] Minor compile time optimizations in DefaultHandler (#146282)"
This reverts commit b8a529cca1.

Reverted https://github.com/pytorch/pytorch/pull/146282 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146282#issuecomment-2638239575))
2025-02-05 23:13:08 +00:00