mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 00:20:18 +01:00
b2953f5643
10 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b2953f5643 |
[9/N] Apply ruff UP035 rule (#165515)
This is follow-up of #165214 to continue applying ruff UP035 rule to the code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515 Approved by: https://github.com/Lucaskabela |
||
|
|
a2a75be0f8 |
Rename inductor cache (#156128)
Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan |
||
|
|
b878ca0c91 |
[cutlass backend] add fp8 to cutlass benchmark script (#155507)
Summary: Add fp8. Right now FP8 only allows fast_accum. Test Plan: ``` Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn +-----------------------+--------------------+--------------------+----------------------+--------------------+ | name | forward_time (us) | teraflops (TFLOPS) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+--------------------+----------------------+--------------------+ | aten | 967.1226739883423 | 1136.8895149998868 | 1.219131228979677 | NA | | triton | 1764.6185159683228 | 623.08743664783 | 20.373826419003308 | 82.46067054670186 | | triton_persistent_tma | 1769.0335512161255 | 621.5323768280928 | 20.48663099599071 | 82.91718297956578 | | cutlass_lvl_default | 790.5075550079346 | 1390.8932568835019 | 13.788519630907103 | -18.26191482535096 | | cutlass_lvl_3332 | 803.7384748458862 | 1367.996757884245 | 226.81587297911756 | -16.89384434227684 | +-----------------------+--------------------+--------------------+----------------------+--------------------+ ``` Rollback Plan: Differential Revision: D76310809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507 Approved by: https://github.com/ColinPeppler |
||
|
|
2481c4b2ea |
[cutlass backend] add teraflops and increase rep for benchmark script (#154944)
Differential Revision: [D75840023](https://our.internmc.facebook.com/intern/diff/D75840023/) I think I will continue to use do_bench for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154944 Approved by: https://github.com/mlazos |
||
|
|
cb56df55dc |
[Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331)
Fixes #153298 This PR is the 3rd and final step of #147479 All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated. All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary. [henrylhtsang](https://github.com/henrylhtsang) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331 Approved by: https://github.com/henrylhtsang |
||
|
|
00ebbbb701 |
[cutlass backend] add addmm and bmm for cutlass backend benchmark (#152163)
Copying what @kadeng did. ``` FINAL results... Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 44.454172253608704 | 3.0991086587309837 | NA | | triton | 44.06978189945221 | 0.07496077567338943 | -0.8646890374284049 | | triton_persistent_tma | 43.598245829343796 | 0.06154991965740919 | -1.9254130284597197 | | cutlass_lvl_default | 39.91834074258804 | 0.056073310784995556 | -10.20338762612423 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.bfloat16 +-----------------------+-------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+-------------------+----------------------+---------------------+ | aten | 49.05610531568527 | 0.160279156640172 | NA | | triton | 43.97720843553543 | 0.0660805031657219 | -10.353241145961718 | | triton_persistent_tma | 43.94153505563736 | 0.061738294549286366 | -10.425960697724962 | | cutlass_lvl_default | 40.2066633105278 | 0.034127906896173954 | -18.039430460713596 | +-----------------------+-------------------+----------------------+---------------------+ Average edge over aten (max(-edge, 0), higher is better): triton: 5.608965091695062 (from 2 valid values) triton_persistent_tma: 6.175686863092341 (from 2 valid values) cutlass_lvl_default: 14.121409043418913 (from 2 valid values) ``` Differential Revision: [D73625766](https://our.internmc.facebook.com/intern/diff/D73625766/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152163 Approved by: https://github.com/jingsh |
||
|
|
5a51de5ab1 |
[cutlass backend] Add more logs for cutlass backend benchmark (#150639)
Goal is to have a way to compare if a change make it better or worse. ``` Average edge over aten (max(-edge, 0), higher is better): triton: 8.596507086950552 (from 6 valid values) triton_persistent_tma: 9.517193693923307 (from 6 valid values) cutlass_lvl_default: 3.3234737908691785 (from 6 valid values) cutlass_lvl_1111: 7.088173348313991 (from 6 valid values) cutlass_lvl_2222: 7.291869722320318 (from 6 valid values) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639 Approved by: https://github.com/ColinPeppler |
||
|
|
f2d43d866c |
[cutlass backend] switch layout for cutlass backend benchmark (#149009)
``` python benchmarks/inductor_backends/cutlass.py ``` logs: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 13.059554621577263 | 1.580178506206721 | NA | | triton | 10.245470330119133 | 0.04118620231747627 | -21.54808776410064 | | triton_persistent_tma | 10.388538241386414 | 0.04225084185600281 | -20.45258400908819 | | cutlass_lvl_default | 12.882896699011326 | 231.14990583620965 | -1.3527101626732294 | | cutlass_lvl_1111 | 11.362981051206589 | 126.41650272067636 | -12.99105229490415 | | cutlass_lvl_2222 | 11.107578873634338 | 555.8380545829423 | -14.946725248331441 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 14.037585817277431 | 0.21587548777461052 | NA | | triton | 10.571777820587158 | 78.15654796129093 | -24.68948750735019 | | triton_persistent_tma | 10.761583223938942 | 1.3195342738181353 | -23.337364672110443 | | cutlass_lvl_default | 12.872588820755482 | 237.0100042372942 | -8.299126443010406 | | cutlass_lvl_1111 | 11.08622644096613 | 137.55013868492097 | -21.02469338195443 | | cutlass_lvl_2222 | 11.044904589653015 | 551.265836935956 | -21.319059178545007 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 30.483894050121307 | 0.27990864124149084 | NA | | triton | 29.567627236247063 | 99.87172158574685 | -3.005740711366232 | | triton_persistent_tma | 29.66325916349888 | 1.3695051120594144 | -2.692027748401006 | | cutlass_lvl_default | 29.82821688055992 | 72.61214569816366 | -2.150897022812533 | | cutlass_lvl_1111 | 29.476772993803024 | 67.7428645719774 | -3.303780857728953 | | cutlass_lvl_2222 | 30.113255605101585 | 233.84051702311262 | -1.2158500630212203 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 30.58255836367607 | 0.058386584743857384 | NA | | triton | 29.799651354551315 | 100.18178300186992 | -2.559978795150901 | | triton_persistent_tma | 29.362043365836143 | 1.534341821912676 | -3.990885861562106 | | cutlass_lvl_default | 29.4346883893013 | 73.68858492700383 | -3.7533484305817093 | | cutlass_lvl_1111 | 29.164200648665428 | 75.44329373072833 | -4.637799421958348 | | cutlass_lvl_2222 | 29.13798950612545 | 227.33327346481383 | -4.7235056020244 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+--------------------+ | aten | 1656.6237211227417 | 0.0549461180344224 | NA | | triton | 1892.8285837173462 | 2.3174119112081826 | 14.258208401997386 | | triton_persistent_tma | 1665.332317352295 | 2.7922237082384527 | 0.525683419747917 | | cutlass_lvl_default | 1705.5492401123047 | 108.31571159465238 | 2.9533272019312116 | | cutlass_lvl_1111 | 1714.9059772491455 | 17.64627545280382 | 3.518134829489478 | | cutlass_lvl_2222 | 1680.4152727127075 | 306.9972395859659 | 1.4361469829637354 | +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+--------------------+ | aten | 1621.416687965393 | 0.06300561130046844 | NA | | triton | 1782.3902368545532 | 2.318530729971826 | 9.927956834535548 | | triton_persistent_tma | 1586.0934257507324 | 2.7931175641715527 | -2.178543151605614 | | cutlass_lvl_default | 1657.4617624282837 | 43.31810224894434 | 2.2230605328307784 | | cutlass_lvl_1111 | 1641.5367126464844 | 17.648567833006382 | 1.2408916739557292 | | cutlass_lvl_2222 | 1645.8417177200317 | 249.33647010894492 | 1.5064005407078918 | +-----------------------+--------------------+----------------------+--------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009 Approved by: https://github.com/chenyang78, https://github.com/jingsh |
||
|
|
66300d3d55 |
[cutlass backend] try make cutlass backend benchmark more robust (#149015)
Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/) I want to make sure the benchmark even if failed on some experiment can still print most of the results. ``` Experiment group: mm (3x3, 3x3) torch.bfloat16 +-----------------------+-------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+-------------------+----------------------+---------------------+ | aten | 6.175220478326082 | 0.5982149520423263 | NA | | triton | 5.326753947883844 | 3.2067150759976357 | -13.739858089605114 | | triton_persistent_tma | 5.340870004147291 | 3.279932268196717 | -13.51126615004617 | | cutlass_lvl_default | inf | inf | inf | | cutlass_lvl_1111 | inf | inf | inf | | cutlass_lvl_2222 | inf | inf | inf | | cutlass_lvl_3333 | inf | inf | inf | +-----------------------+-------------------+----------------------+---------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015 Approved by: https://github.com/chenyang78, https://github.com/jingsh |
||
|
|
17518007b2 |
[cutlass backend] Benchmark compared to aten and triton (#148347)
Benchmark for cutlass backend. ``` python benchmarks/inductor_backends/cutlass.py ``` Test Plan: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 12.759539298713207 | 2.7271360370796174 | NA | | triton | 10.573655366897583 | 1.8661278090439737 | -17.131370346859384 | | triton_persistent_tma | 10.884030722081661 | 0.5315794269554317 | -14.698873781600327 | | cutlass_lvl_default | 13.09632882475853 | 0.5520401500398293 | 2.6395116481931873 | | cutlass_lvl_1111 | 11.05172373354435 | 0.569593315012753 | -13.384617776451302 | | cutlass_lvl_2222 | 11.371277272701263 | 133.58984916994814 | -10.880189272601317 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 14.472318813204765 | 1.5445372510002926 | NA | | triton | 10.568295605480671 | 16.583424195996486 | -26.975796056689987 | | triton_persistent_tma | 10.45411266386509 | 5.830657540936954 | -27.764770809729562 | | cutlass_lvl_default | 12.742593884468079 | 28.994930602959357 | -11.951954286402668 | | cutlass_lvl_1111 | 11.522261425852776 | 79.85037935699802 | -20.38413764531163 | | cutlass_lvl_2222 | 10.993581265211105 | 132.86601971101481 | -24.037181552548486 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+---------------------+ | aten | 30.700622126460075 | 2.225986961973831 | NA | | triton | 29.17378954589367 | 38.571991189033724 | -4.97329524553989 | | triton_persistent_tma | 29.642896726727486 | 7.2848734309664 | -3.4452897904663744 | | cutlass_lvl_default | 29.514770954847336 | 29.819900761009194 | -3.8626291243482167 | | cutlass_lvl_1111 | 29.411429539322853 | 23.82907024596352 | -4.19923929172139 | | cutlass_lvl_2222 | 29.57325428724289 | 134.31008586101234 | -3.672133530628152 | +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+--------------------+ | aten | 30.858177691698074 | 1.181898436974734 | NA | | triton | 28.630023822188377 | 39.24473957403097 | -7.220626868414034 | | triton_persistent_tma | 28.641965240240097 | 5.275042273919098 | -7.181929126210897 | | cutlass_lvl_default | 29.16003204882145 | 29.934022572939284 | -5.503065216107967 | | cutlass_lvl_1111 | 28.79570797085762 | 23.948012012057006 | -6.683705504085324 | | cutlass_lvl_2222 | 29.02756631374359 | 136.25560767308343 | -5.932337924306467 | +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+--------------------+ | aten | 1456.143856048584 | 1.020197194069624 | NA | | triton | 1708.2737684249878 | 5.766509635956027 | 17.31490410985819 | | triton_persistent_tma | 1476.485013961792 | 7.455113030038774 | 1.3969195302177155 | | cutlass_lvl_default | 1583.3594799041748 | 50.408804678940214 | 8.736473620182366 | | cutlass_lvl_1111 | 1636.4418268203735 | 82.82403108896688 | 12.381879030898025 | | cutlass_lvl_2222 | 1507.5665712356567 | 260.03901409788523 | 3.531430975962381 | +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ | name | forward_time (us) | compilation_time (s) | perf_over_aten (%) | +-----------------------+--------------------+----------------------+--------------------+ | aten | 1382.230520248413 | 1.2586536260787398 | NA | | triton | 1646.9683647155762 | 5.442052865982987 | 19.15294450447995 | | triton_persistent_tma | 1423.9195585250854 | 6.515797697938979 | 3.016069871556595 | | cutlass_lvl_default | 1500.9030103683472 | 51.36402789200656 | 8.58557877152115 | | cutlass_lvl_1111 | 1446.9740390777588 | 30.65435610699933 | 4.683988515729638 | | cutlass_lvl_2222 | 1419.661521911621 | 205.1948991640238 | 2.7080144096717635 | +-----------------------+--------------------+----------------------+--------------------+ ``` Differential Revision: D70147589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347 Approved by: https://github.com/drisspg, https://github.com/chenyang78 |