Shunting Zhang
6c7d8419e3
fix two accuracy regression ( #149172 )
...
There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check.
- error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316
- error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172
Approved by: https://github.com/jansel , https://github.com/eellison
2025-03-17 19:34:00 +00:00
Aaron Gokaslan
bfee141666
[BE]: Apply ruff PERF403 to use dict comprehensions more often ( #149257 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-16 23:52:58 +00:00
PyTorch MergeBot
f9b4856989
Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence ( #113257 )"
...
This reverts commit c95a6b416b .
Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539 ))
2025-03-14 23:13:34 +00:00
Xuehai Pan
c95a6b416b
[pytree] add APIs to determine a class is a namedtuple or PyStructSequence ( #113257 )
...
Changes in this PR:
1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.
Resolves #75982 . New tests are included in this PR.
- #75982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-14 08:50:30 +00:00
henrylhtsang
f2d43d866c
[cutlass backend] switch layout for cutlass backend benchmark ( #149009 )
...
```
python benchmarks/inductor_backends/cutlass.py
```
logs:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 13.059554621577263 | 1.580178506206721 | NA |
| triton | 10.245470330119133 | 0.04118620231747627 | -21.54808776410064 |
| triton_persistent_tma | 10.388538241386414 | 0.04225084185600281 | -20.45258400908819 |
| cutlass_lvl_default | 12.882896699011326 | 231.14990583620965 | -1.3527101626732294 |
| cutlass_lvl_1111 | 11.362981051206589 | 126.41650272067636 | -12.99105229490415 |
| cutlass_lvl_2222 | 11.107578873634338 | 555.8380545829423 | -14.946725248331441 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 14.037585817277431 | 0.21587548777461052 | NA |
| triton | 10.571777820587158 | 78.15654796129093 | -24.68948750735019 |
| triton_persistent_tma | 10.761583223938942 | 1.3195342738181353 | -23.337364672110443 |
| cutlass_lvl_default | 12.872588820755482 | 237.0100042372942 | -8.299126443010406 |
| cutlass_lvl_1111 | 11.08622644096613 | 137.55013868492097 | -21.02469338195443 |
| cutlass_lvl_2222 | 11.044904589653015 | 551.265836935956 | -21.319059178545007 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 30.483894050121307 | 0.27990864124149084 | NA |
| triton | 29.567627236247063 | 99.87172158574685 | -3.005740711366232 |
| triton_persistent_tma | 29.66325916349888 | 1.3695051120594144 | -2.692027748401006 |
| cutlass_lvl_default | 29.82821688055992 | 72.61214569816366 | -2.150897022812533 |
| cutlass_lvl_1111 | 29.476772993803024 | 67.7428645719774 | -3.303780857728953 |
| cutlass_lvl_2222 | 30.113255605101585 | 233.84051702311262 | -1.2158500630212203 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 30.58255836367607 | 0.058386584743857384 | NA |
| triton | 29.799651354551315 | 100.18178300186992 | -2.559978795150901 |
| triton_persistent_tma | 29.362043365836143 | 1.534341821912676 | -3.990885861562106 |
| cutlass_lvl_default | 29.4346883893013 | 73.68858492700383 | -3.7533484305817093 |
| cutlass_lvl_1111 | 29.164200648665428 | 75.44329373072833 | -4.637799421958348 |
| cutlass_lvl_2222 | 29.13798950612545 | 227.33327346481383 | -4.7235056020244 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1656.6237211227417 | 0.0549461180344224 | NA |
| triton | 1892.8285837173462 | 2.3174119112081826 | 14.258208401997386 |
| triton_persistent_tma | 1665.332317352295 | 2.7922237082384527 | 0.525683419747917 |
| cutlass_lvl_default | 1705.5492401123047 | 108.31571159465238 | 2.9533272019312116 |
| cutlass_lvl_1111 | 1714.9059772491455 | 17.64627545280382 | 3.518134829489478 |
| cutlass_lvl_2222 | 1680.4152727127075 | 306.9972395859659 | 1.4361469829637354 |
+-----------------------+--------------------+----------------------+--------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1621.416687965393 | 0.06300561130046844 | NA |
| triton | 1782.3902368545532 | 2.318530729971826 | 9.927956834535548 |
| triton_persistent_tma | 1586.0934257507324 | 2.7931175641715527 | -2.178543151605614 |
| cutlass_lvl_default | 1657.4617624282837 | 43.31810224894434 | 2.2230605328307784 |
| cutlass_lvl_1111 | 1641.5367126464844 | 17.648567833006382 | 1.2408916739557292 |
| cutlass_lvl_2222 | 1645.8417177200317 | 249.33647010894492 | 1.5064005407078918 |
+-----------------------+--------------------+----------------------+--------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009
Approved by: https://github.com/chenyang78 , https://github.com/jingsh
2025-03-13 01:57:47 +00:00
henrylhtsang
66300d3d55
[cutlass backend] try make cutlass backend benchmark more robust ( #149015 )
...
Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/ )
I want to make sure the benchmark even if failed on some experiment can still print most of the results.
```
Experiment group: mm (3x3, 3x3) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+-------------------+----------------------+---------------------+
| aten | 6.175220478326082 | 0.5982149520423263 | NA |
| triton | 5.326753947883844 | 3.2067150759976357 | -13.739858089605114 |
| triton_persistent_tma | 5.340870004147291 | 3.279932268196717 | -13.51126615004617 |
| cutlass_lvl_default | inf | inf | inf |
| cutlass_lvl_1111 | inf | inf | inf |
| cutlass_lvl_2222 | inf | inf | inf |
| cutlass_lvl_3333 | inf | inf | inf |
+-----------------------+-------------------+----------------------+---------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015
Approved by: https://github.com/chenyang78 , https://github.com/jingsh
2025-03-12 18:59:49 +00:00
LifengWang
e40a9e602b
Add the max_autotune tests in the periodic jobs. ( #143560 )
...
To promptly detect issues with max_autotune, such as [#143102 ](https://github.com/pytorch/pytorch/issues/143102 ), add the max_autotune tests to the periodic CI to track the accuracy regularly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560
Approved by: https://github.com/leslie-fang-intel , https://github.com/desertfire
2025-03-12 01:47:46 +00:00
Bin Bao
f69e58e8e8
[CI] Update crossvit_9_240 as pass ( #148989 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989
Approved by: https://github.com/ZainRizvi
2025-03-11 20:54:39 +00:00
Rengan Xu
da4bb72a71
Backout D70075331 ( #148824 )
...
Summary:
The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0"
So we revert D70075331 as a workaround now.
Test Plan: The model could be lowered and published successfully. e.g. 702869739_16
Differential Revision: D70823254
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824
Approved by: https://github.com/eqy
2025-03-11 12:51:17 +00:00
PyTorch MergeBot
ebd087e4b5
Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence ( #113257 )"
...
This reverts commit f08146b67b .
Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830 ))
2025-03-10 17:19:21 +00:00
Jason Ansel
a60b4ed623
[fx] Optimize TracerBase.create_arg and Graph._gen_python_code ( #148292 )
...
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243 , #148260 , #148261 , #148288
2025-03-10 16:06:19 +00:00
Jason Ansel
8f858e226b
[fx] Optimizations for node name generation ( #148288 )
...
Before:

After:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243 , #148260 , #148261
2025-03-10 16:06:19 +00:00
Jason Ansel
5d4e7d58b4
[fx] Move Node._prepend/Node._remove_from_list to C++ ( #148261 )
...
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243 , #148260
2025-03-10 16:06:11 +00:00
Jason Ansel
bf752c36da
[fx] Move Node._update_args_kwargs to C++ ( #148260 )
...
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-10 16:06:02 +00:00
Jason Ansel
bec7bdad47
[fx] Move map_aggregate to C++ ( #148243 )
...
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-10 16:05:53 +00:00
atalman
2068235c0a
Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD ( #148788 )
...
After https://github.com/pytorch/pytorch/pull/148612
This model have become flaky
Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788
Approved by: https://github.com/izaitsevfb , https://github.com/malfet
2025-03-10 13:40:41 +00:00
Jason Ansel
9a1a2e1516
Better log message to update pr_time_benchmarks/expected_results.csv ( #148303 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
2025-03-09 17:12:47 +00:00
Ting Lu
9769618d35
[CI] [inductor] Add cu126 inductor jobs and move away cu124 ( #148612 )
...
https://github.com/pytorch/pytorch/issues/145570
breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock
Seems many inductor yml are added after initial change was prepared.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612
Approved by: https://github.com/nWEIdia , https://github.com/atalman
Co-authored-by: atalman <atalman@fb.com>
2025-03-07 18:30:14 +00:00
drisspg
127bd5a02d
Add sparsity ( #148513 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148513
Approved by: https://github.com/danielvegamyhre
2025-03-07 01:47:52 +00:00
Shunting Zhang
262411e48b
[inductor] online softmax ( #127011 )
...
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row
When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.
Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).
Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54
## Microbenchmark
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
- eager_ms=6.671296119689941
- opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
- eager_ms=6.634047985076904
- opt_ms=6.230591773986816
Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
Xuehai Pan
f08146b67b
[pytree] add APIs to determine a class is a namedtuple or PyStructSequence ( #113257 )
...
Changes in this PR:
1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.
Resolves #75982 . New tests are included in this PR.
- #75982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-06 18:59:02 +00:00
Bin Bao
d10bacd4ce
[AOTI][dashboard] Skip torchbench models not supported by export ( #148359 )
...
Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359
Approved by: https://github.com/angelayi , https://github.com/ysiraichi
2025-03-06 18:08:17 +00:00
Laith Sakka
913356fb41
Fix recent regression in evaluate_expr that effect cache lookups ( #147836 )
...
PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging.
This caused a regression that we thought is due to calling id on symnode.
I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache
lookups.
I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr
#suppress-bc-linter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836
Approved by: https://github.com/oulgen
2025-03-05 04:11:41 +00:00
PyTorch MergeBot
92beda54c8
Revert "[fx] Move map_aggregate to C++ ( #148243 )"
...
This reverts commit edaff88f69 .
Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058 ))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
17d003fe75
Revert "[fx] Move Node._update_args_kwargs to C++ ( #148260 )"
...
This reverts commit 0135f57f4a .
Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058 ))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
97b9e68bc6
Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ ( #148261 )"
...
This reverts commit 29c2de9ae1 .
Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058 ))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
6fb18ff685
Revert "Better log message to update pr_time_benchmarks/expected_results.csv ( #148303 )"
...
This reverts commit a3d69e6e1a .
Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058 ))
2025-03-04 19:40:21 +00:00
PyTorch MergeBot
611b0e9bc4
Revert "[fx] Optimizations for node name generation ( #148288 )"
...
This reverts commit 5eb0337cfd .
Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002 ) [HUD commit link](8531d247ba ). dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172 ))
2025-03-04 17:10:12 +00:00
PyTorch MergeBot
ed9055c303
Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code ( #148292 )"
...
This reverts commit 8531d247ba .
Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002 ) [HUD commit link](8531d247ba ). dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172 ))
2025-03-04 17:10:12 +00:00
drisspg
e0f0db0105
updates to benchmarks ( #144831 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144831
Approved by: https://github.com/danielvegamyhre
2025-03-04 06:21:12 +00:00
Jason Ansel
8531d247ba
[fx] Optimize TracerBase.create_arg and Graph._gen_python_code ( #148292 )
...
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243 , #148260 , #148261 , #148303 , #148288
2025-03-04 02:42:23 +00:00
Jason Ansel
5eb0337cfd
[fx] Optimizations for node name generation ( #148288 )
...
Before:

After:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243 , #148260 , #148261 , #148303
2025-03-04 02:42:23 +00:00
Jason Ansel
a3d69e6e1a
Better log message to update pr_time_benchmarks/expected_results.csv ( #148303 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
ghstack dependencies: #148243 , #148260 , #148261
2025-03-04 02:42:23 +00:00
Henry Tsang
17518007b2
[cutlass backend] Benchmark compared to aten and triton ( #148347 )
...
Benchmark for cutlass backend.
```
python benchmarks/inductor_backends/cutlass.py
```
Test Plan:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 12.759539298713207 | 2.7271360370796174 | NA |
| triton | 10.573655366897583 | 1.8661278090439737 | -17.131370346859384 |
| triton_persistent_tma | 10.884030722081661 | 0.5315794269554317 | -14.698873781600327 |
| cutlass_lvl_default | 13.09632882475853 | 0.5520401500398293 | 2.6395116481931873 |
| cutlass_lvl_1111 | 11.05172373354435 | 0.569593315012753 | -13.384617776451302 |
| cutlass_lvl_2222 | 11.371277272701263 | 133.58984916994814 | -10.880189272601317 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 14.472318813204765 | 1.5445372510002926 | NA |
| triton | 10.568295605480671 | 16.583424195996486 | -26.975796056689987 |
| triton_persistent_tma | 10.45411266386509 | 5.830657540936954 | -27.764770809729562 |
| cutlass_lvl_default | 12.742593884468079 | 28.994930602959357 | -11.951954286402668 |
| cutlass_lvl_1111 | 11.522261425852776 | 79.85037935699802 | -20.38413764531163 |
| cutlass_lvl_2222 | 10.993581265211105 | 132.86601971101481 | -24.037181552548486 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 30.700622126460075 | 2.225986961973831 | NA |
| triton | 29.17378954589367 | 38.571991189033724 | -4.97329524553989 |
| triton_persistent_tma | 29.642896726727486 | 7.2848734309664 | -3.4452897904663744 |
| cutlass_lvl_default | 29.514770954847336 | 29.819900761009194 | -3.8626291243482167 |
| cutlass_lvl_1111 | 29.411429539322853 | 23.82907024596352 | -4.19923929172139 |
| cutlass_lvl_2222 | 29.57325428724289 | 134.31008586101234 | -3.672133530628152 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 30.858177691698074 | 1.181898436974734 | NA |
| triton | 28.630023822188377 | 39.24473957403097 | -7.220626868414034 |
| triton_persistent_tma | 28.641965240240097 | 5.275042273919098 | -7.181929126210897 |
| cutlass_lvl_default | 29.16003204882145 | 29.934022572939284 | -5.503065216107967 |
| cutlass_lvl_1111 | 28.79570797085762 | 23.948012012057006 | -6.683705504085324 |
| cutlass_lvl_2222 | 29.02756631374359 | 136.25560767308343 | -5.932337924306467 |
+-----------------------+--------------------+----------------------+--------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1456.143856048584 | 1.020197194069624 | NA |
| triton | 1708.2737684249878 | 5.766509635956027 | 17.31490410985819 |
| triton_persistent_tma | 1476.485013961792 | 7.455113030038774 | 1.3969195302177155 |
| cutlass_lvl_default | 1583.3594799041748 | 50.408804678940214 | 8.736473620182366 |
| cutlass_lvl_1111 | 1636.4418268203735 | 82.82403108896688 | 12.381879030898025 |
| cutlass_lvl_2222 | 1507.5665712356567 | 260.03901409788523 | 3.531430975962381 |
+-----------------------+--------------------+----------------------+--------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1382.230520248413 | 1.2586536260787398 | NA |
| triton | 1646.9683647155762 | 5.442052865982987 | 19.15294450447995 |
| triton_persistent_tma | 1423.9195585250854 | 6.515797697938979 | 3.016069871556595 |
| cutlass_lvl_default | 1500.9030103683472 | 51.36402789200656 | 8.58557877152115 |
| cutlass_lvl_1111 | 1446.9740390777588 | 30.65435610699933 | 4.683988515729638 |
| cutlass_lvl_2222 | 1419.661521911621 | 205.1948991640238 | 2.7080144096717635 |
+-----------------------+--------------------+----------------------+--------------------+
```
Differential Revision: D70147589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347
Approved by: https://github.com/drisspg , https://github.com/chenyang78
2025-03-04 01:45:36 +00:00
Jason Ansel
29c2de9ae1
[fx] Move Node._prepend/Node._remove_from_list to C++ ( #148261 )
...
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243 , #148260
2025-03-02 22:42:31 +00:00
Jason Ansel
0135f57f4a
[fx] Move Node._update_args_kwargs to C++ ( #148260 )
...
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-02 22:42:31 +00:00
Jason Ansel
edaff88f69
[fx] Move map_aggregate to C++ ( #148243 )
...
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-02 22:42:31 +00:00
Boyuan Feng
6e10471966
[ci] disable cudagraph for tts_angular on dashboard ( #148221 )
...
tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular.
[Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221
Approved by: https://github.com/eellison
2025-03-02 03:31:19 +00:00
Xuehai Pan
c73a92fbf5
[BE][CI] bump ruff to 0.9.2: multiline assert statements ( #144546 )
...
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
Katarzyna Fojcik
edaf9ddeb5
Add basic Gaudi support to benchmarks/dynamo ( #145920 )
...
This PR adds basic Gaudi support to benchmarks/dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
Zesheng Zong
580f1183b4
Enable ruff rule S324 ( #147665 )
...
Fixes #147627
- Add `S324` in `pyproject.toml `
- Running check and clean warnings
```bash
lintrunner --take RUFF --all-files
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 18:27:34 +00:00
Oguz Ulgen
bb7e8fbd66
[CacheBench] Add hf_T5 llama moco to cachebench ( #147783 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783
Approved by: https://github.com/huydhn
ghstack dependencies: #147688 , #147780 , #147781 , #147782
2025-02-25 04:34:45 +00:00
Oguz Ulgen
895564d6b6
[CacheBench] Add huggingface ( #147782 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782
Approved by: https://github.com/huydhn
ghstack dependencies: #147688 , #147780 , #147781
2025-02-25 04:34:45 +00:00
Oguz Ulgen
c4fb6ae55d
[CacheBench] Separate dynamic into its own option ( #147781 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781
Approved by: https://github.com/huydhn
ghstack dependencies: #147688 , #147780
2025-02-25 04:34:34 +00:00
Oguz Ulgen
60d4cbfc06
[CacheBench] Add repeat option so that we can have more accurate cache results ( #147780 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780
Approved by: https://github.com/huydhn
ghstack dependencies: #147688
2025-02-25 04:34:25 +00:00
Oguz Ulgen
ab3b814af3
[CacheBench] Add ciflow/trunk test ( #147688 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688
Approved by: https://github.com/huydhn
2025-02-25 04:34:16 +00:00
Xuehai Pan
754fb834db
[BE][CI] bump ruff to 0.9.0: string quote styles ( #144569 )
...
Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting
- Change the outer quotes to double quotes for nested f-strings
```diff
- f'{", ".join(args)}'
+ f"{', '.join(args)}"
```
- Change the inner quotes to double quotes for triple f-strings
```diff
string = """
- {', '.join(args)}
+ {", ".join(args)}
"""
```
- Join implicitly concatenated strings
```diff
- string = "short string " "short string " f"{var}"
+ string = f"short string short string {var}"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569
Approved by: https://github.com/Skylion007
ghstack dependencies: #146509
2025-02-24 19:56:09 +00:00
eqy
718cf68aee
[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces ( #145130 )
...
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.
This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:
+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
Aaron Orenstein
086d146f6f
Update ruff linter for PEP585 ( #147540 )
...
This turns on PEP585 enforcement in RUFF.
- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby , https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
Oguz Ulgen
1c334893dc
[CacheBench] Refactor code to prepare for mode benchmarks ( #147641 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641
Approved by: https://github.com/huydhn
2025-02-22 00:20:54 +00:00