pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Gabriel Ferns	ce00ec7ecf	Enable max autotune for AOTInductor benchmark (#149309 ) With this PR, AOTinductor can choose to run into max-autotune mode when benchmarking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149309 Approved by: https://github.com/desertfire Co-authored-by: Gabriel Ferns <gabeferns@meta.com>	2025-04-28 06:54:26 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
Animesh Jain	3c1a17a08b	[Dynamo] Use LazyVariableTracker in base VT (#151847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151847 Approved by: https://github.com/StrongerXi	2025-04-23 18:18:01 +00:00
Eddie Yan	dcc32ff5bf	[CUDA][cuBLAS][cuBLASLt] Opt-in unified cuBLAS + cuBLASLt workspaces (#151163 ) opt-in version of https://github.com/pytorch/pytorch/pull/145130 as there was a lack of repro for the 70% forward issue `TORCH_CUBLASLT_UNIFIED_WORKSPACE=1` @izaitsevfb could you comment if it was repeatable per every forward pass, on startup, or something else? Pull Request resolved: https://github.com/pytorch/pytorch/pull/151163 Approved by: https://github.com/ngimel	2025-04-23 15:24:22 +00:00
Laith Sakka	09e8ff92cc	refresh benchmark results (#151622 ) updating due to <1.5% increases in https://github.com/pytorch/pytorch/pull/151469 not all benchmarks were updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/151622 Approved by: https://github.com/oulgen	2025-04-18 02:39:13 +00:00
Oguz Ulgen	ef64beb232	Include post grad gm and fx runnable in cache artifacts for tlparse (#151469 ) Fixed #151462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151469 Approved by: https://github.com/bdhirsh	2025-04-17 17:14:13 +00:00
PyTorch MergeBot	41b82611ee	Revert "[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756 )" This reverts commit `300e0ee13c`. Reverted https://github.com/pytorch/pytorch/pull/144756 on behalf of https://github.com/malfet due to Broke rocm torch bench runs with TypeError: unsupported operand type(s) for \|: 'set' and 'list' ([comment](https://github.com/pytorch/pytorch/pull/144756#issuecomment-2812525970))	2025-04-17 11:09:01 +00:00
Mao Yunfei	300e0ee13c	[Reopen] [Intel GPU] Set higher tolerance for some models only on XPU Device (#144756 ) Reopen the previous stale closed PR https://github.com/pytorch/pytorch/pull/134192 We need to increase the tolerance slightly to ensure that certain models pass accuracy check on the XPU device. This pull request preserves the original tolerance threshold for the CUDA device and introduces a new key higher_fp16_bf16_xpu, which only impacts the XPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144756 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/desertfire	2025-04-17 00:26:55 +00:00
Brian Hirsh	eea4a7b424	update expected results for comptime benchmark (#151319 ) This PR https://github.com/pytorch/pytorch/pull/150594 bumped the benchmark up by ~1%, a bit under our 1.5% "regression" mark. Modeled this PR after https://github.com/pytorch/pytorch/pull/144274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151319 Approved by: https://github.com/jamesjwu, https://github.com/laithsakka	2025-04-15 19:40:13 +00:00
Shunting Zhang	8f440a8e70	don't return logits for benchmark script (#151075 ) PT2 benchmark scripts has a pattern like: ``` def forward_and_backward_pass(self, mod, inputs, collect_outputs=True): cloned_inputs = clone_inputs(inputs) self.optimizer_zero_grad(mod) with self.autocast(self.autocast_arg): pred = mod(cloned_inputs) loss = self.compute_loss(pred) self.grad_scaler.scale(loss).backward() self.optimizer_step() if collect_outputs: return collect_results(mod, pred, loss, cloned_inputs) return None ``` for training. The collect_outputs argument is True only for accuracy testing and it's false for performance testing. For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow. A few bad things if we keep logits here 1. the peak memory will be higher since the logits is large and we can not release its memory earlier. 2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph Actually I think it's fine to not return logits at all. - For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients. - Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us. On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d)) Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a) - HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement - I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075 Approved by: https://github.com/eellison, https://github.com/jansel	2025-04-15 17:13:00 +00:00
Zhang, Jianyi	a756c50315	[Intel GPU] Avoid using fp32 in sdp math path when benchmark performance. (#150996 ) sdp on xpu will fallback to math path in some cases (i.e. training). In dynamo benchmark, we prefer to use fp16 for better performance. Although `allow_fp16_bf16_reduction_math_sdp` is under backends.cuda, its implementation is for all device. I didn't add if device == xpu here, I suppose cuda devices will not run into math path anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/150996 Approved by: https://github.com/drisspg, https://github.com/EikanWang	2025-04-15 08:08:01 +00:00
henrylhtsang	5a51de5ab1	[cutlass backend] Add more logs for cutlass backend benchmark (#150639 ) Goal is to have a way to compare if a change make it better or worse. ``` Average edge over aten (max(-edge, 0), higher is better): triton: 8.596507086950552 (from 6 valid values) triton_persistent_tma: 9.517193693923307 (from 6 valid values) cutlass_lvl_default: 3.3234737908691785 (from 6 valid values) cutlass_lvl_1111: 7.088173348313991 (from 6 valid values) cutlass_lvl_2222: 7.291869722320318 (from 6 valid values) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639 Approved by: https://github.com/ColinPeppler	2025-04-15 04:19:51 +00:00
Animesh Jain	7b1a2373e8	[dynamo][super variable] Fix bug to use correct source (#151154 ) Fixes https://github.com/pytorch/pytorch/issues/150994 We should cherry-pick to 2.7 branch if possible, because this breaks torch.compile on some HF models. Look at the issue referenced here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151154 Approved by: https://github.com/jansel	2025-04-13 04:48:52 +00:00
PyTorch MergeBot	67d3053d4b	Revert "update benchamark result due to <1% regression (#150937 )" This reverts commit `860765d621`. Reverted https://github.com/pytorch/pytorch/pull/150937 on behalf of https://github.com/laithsakka due to regression diff reverted ([comment](https://github.com/pytorch/pytorch/pull/150937#issuecomment-2797611127))	2025-04-11 17:36:47 +00:00
Laith Sakka	91d1826539	Add dynamic version for mm_loop benchmark (#150865 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150865 Approved by: https://github.com/eellison	2025-04-09 23:37:43 +00:00
Laith Sakka	860765d621	update benchamark result due to <1% regression (#150937 ) <img width="1503" alt="Screenshot 2025-04-09 at 9 07 13 AM" src="https://github.com/user-attachments/assets/e16f31b0-c5dc-4dd6-8adb-aac11ed988db" /> PR https://hud.pytorch.org/pr/148104 which is acceptable but we have to update this to avoid flakiness in the future . Pull Request resolved: https://github.com/pytorch/pytorch/pull/150937 Approved by: https://github.com/zou3519	2025-04-09 20:25:48 +00:00
Bin Bao	6a8ab902a2	[AOTI][dashboard] Fix mis-calculated memory compression ratio (#150695 ) Summary: https://github.com/pytorch/pytorch/pull/149817 introduced an extra warmup run to compute AOTI memory compression ratio, but since weights are only loaded once in the AOTI run, the peak memory seen in the extra warmup won't include the weight, which causes an aritifically high memory compression ratio. This PR removes that extra warmup run, and calls reset_peak_memory_stats in the proper place instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150695 Approved by: https://github.com/yushangdi	2025-04-06 19:51:22 +00:00
Laith Sakka	3320efef6b	Refresh expected results. (#150264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150264 Approved by: https://github.com/bobrenjc93	2025-04-05 01:11:19 +00:00
Jason Ansel	d41c22b578	Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 )" (#150542 ) Reverts #148261 due to possible memory leak This reverts commit `5d4e7d58b4`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150542 Approved by: https://github.com/clee2000	2025-04-03 21:15:38 +00:00
Bin Bao	d4c30b4599	[AOTI][dashboard] Update how peak memory is measured (#150534 ) Summary: In the dashboard measurement script, AOTI needs to run Eager first to register the output pytree, so the peak memory compression ratio on the dashboard is always close to 1. Update AOTI run to use an extra warmup run, so the peak memory compression ratio measures the result at the run time instead of the compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150534 Approved by: https://github.com/yushangdi	2025-04-03 12:01:43 +00:00
PyTorch MergeBot	203a27e0ce	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit `8f7fbe3d7d`. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/clee2000 due to reverted internally by D72140190 ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2770874244))	2025-04-01 23:07:28 +00:00
Xuehai Pan	a10b765bf1	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-04-01 10:40:43 +00:00
Zhang, Jianyi	0f12951fc2	[Intel gpu] always set deterministic for xpu accuracy test (#149028 ) On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028 Approved by: https://github.com/guangyey, https://github.com/desertfire Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-04-01 01:00:11 +00:00
LifengWang	51f0403f46	Update the baseline for max_autotune ci workflow (#149107 ) Since the issue https://github.com/pytorch/pytorch/issues/148535 is fixed in PR https://github.com/pytorch/pytorch/pull/148923, update the baseline for max_autotune ci workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149107 Approved by: https://github.com/chuanqi129, https://github.com/leslie-fang-intel, https://github.com/desertfire	2025-03-31 09:45:44 +00:00
Jane Xu	2c9e07ecd2	[BE] Remove outdated RPC benchmark (#146716 ) We have lots of outdated unused + uncalled code in our codebase, namely in our benchmarks and examples folders among others. The last change to this directory was 4 years ago and this code looks dead. cc @albanD @H-Huang for feedback Pull Request resolved: https://github.com/pytorch/pytorch/pull/146716 Approved by: https://github.com/Skylion007, https://github.com/H-Huang	2025-03-29 04:44:36 +00:00
IvanKobzarev	25309a17f0	[aotd] Config to guess_tangents_stride (#150035 ) Differential Revision: [D71907684](https://our.internmc.facebook.com/intern/diff/D71907684) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150035 Approved by: https://github.com/ilyas409, https://github.com/seemethere	2025-03-28 13:54:19 +00:00
Laith Sakka	7379c66344	add loop mm benchmark (#149932 ) results: compile time instruction count for iteration 4 is 67947323682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149932 Approved by: https://github.com/bobrenjc93, https://github.com/eellison	2025-03-26 11:21:30 +00:00
Laith Sakka	6c9d48b32b	refresh results of benchmarks (#149936 ) while the test was disabled, I put a fix but another win change landed before the test was restored to it stayed disabled. <img width="698" alt="Screenshot 2025-03-24 at 6 26 36 PM" src="https://github.com/user-attachments/assets/2713c685-aee2-4dea-9a6c-cad01ef575cd" /> caused by https://github.com/pytorch/pytorch/pull/149295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149936 Approved by: https://github.com/bobrenjc93	2025-03-25 21:01:08 +00:00
Benjamin Glass	23855391f1	Add regression tests for 3 missing PR-time benchmarks (#149423 ) Uses values from the latest PR-time benchmark run on viable/strict. See https://github.com/pytorch/pytorch/actions/runs/13898520615/job/38900894469 for a job showing why this is needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149423 Approved by: https://github.com/laithsakka	2025-03-24 23:39:36 +00:00
Simon Fan	86ee3bf3d5	[ca] use torch.compile ca API for benchmarks (#149647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149647 Approved by: https://github.com/jansel	2025-03-24 19:06:45 +00:00
eqy	8f7fbe3d7d	[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 ) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130 Approved by: https://github.com/ngimel	2025-03-22 05:50:11 +00:00
LifengWang	fa5f556f88	[CI] enable operator benchmark on CPU (#143733 ) This is to enable operator benchmark for CPU to track op level performance. This PR is motivated by PR: https://github.com/pytorch/pytorch/issues/120982 and investigate feasibility in https://github.com/pytorch/pytorch/pull/127216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143733 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/huydhn, https://github.com/malfet Co-authored-by: diwei sun <diwei.sun@intel.com> Co-authored-by: chuanqiw <chuanqi.wang@intel.com>	2025-03-21 16:46:03 +00:00
Pian Pawakapan	e0e8639a10	[torchbench] fix dynamic_shapes spec for moco (#148772 ) Fixes https://github.com/pytorch/pytorch/issues/148333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772 Approved by: https://github.com/yushangdi, https://github.com/desertfire	2025-03-18 18:16:54 +00:00
Laith Sakka	6055a4f612	refresh benchmarks results. (#149347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347 Approved by: https://github.com/jamesjwu	2025-03-18 08:53:49 +00:00
Aaron Gokaslan	a0ac63cbd9	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-18 00:46:07 +00:00
PyTorch MergeBot	24cfeec2c7	Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )" This reverts commit `bfee141666`. Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see `8bc7bd94a5/1` ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))	2025-03-17 22:57:00 +00:00
Shunting Zhang	6c7d8419e3	fix two accuracy regression (#149172 ) There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check. - error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316 - error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172 Approved by: https://github.com/jansel, https://github.com/eellison	2025-03-17 19:34:00 +00:00
Aaron Gokaslan	bfee141666	[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257 Approved by: https://github.com/jansel	2025-03-16 23:52:58 +00:00
PyTorch MergeBot	f9b4856989	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit `c95a6b416b`. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))	2025-03-14 23:13:34 +00:00
Xuehai Pan	c95a6b416b	[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 ) Changes in this PR: 1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence. 2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types. 3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class. Resolves #75982. New tests are included in this PR. - #75982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257 Approved by: https://github.com/zou3519	2025-03-14 08:50:30 +00:00
henrylhtsang	f2d43d866c	[cutlass backend] switch layout for cutlass backend benchmark (#149009 ) ``` python benchmarks/inductor_backends/cutlass.py ``` logs: ``` Experiment group: mm (1024x1024, 1024x1024) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 13.059554621577263 \| 1.580178506206721 \| NA \| \| triton \| 10.245470330119133 \| 0.04118620231747627 \| -21.54808776410064 \| \| triton_persistent_tma \| 10.388538241386414 \| 0.04225084185600281 \| -20.45258400908819 \| \| cutlass_lvl_default \| 12.882896699011326 \| 231.14990583620965 \| -1.3527101626732294 \| \| cutlass_lvl_1111 \| 11.362981051206589 \| 126.41650272067636 \| -12.99105229490415 \| \| cutlass_lvl_2222 \| 11.107578873634338 \| 555.8380545829423 \| -14.946725248331441 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 14.037585817277431 \| 0.21587548777461052 \| NA \| \| triton \| 10.571777820587158 \| 78.15654796129093 \| -24.68948750735019 \| \| triton_persistent_tma \| 10.761583223938942 \| 1.3195342738181353 \| -23.337364672110443 \| \| cutlass_lvl_default \| 12.872588820755482 \| 237.0100042372942 \| -8.299126443010406 \| \| cutlass_lvl_1111 \| 11.08622644096613 \| 137.55013868492097 \| -21.02469338195443 \| \| cutlass_lvl_2222 \| 11.044904589653015 \| 551.265836935956 \| -21.319059178545007 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.float16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.483894050121307 \| 0.27990864124149084 \| NA \| \| triton \| 29.567627236247063 \| 99.87172158574685 \| -3.005740711366232 \| \| triton_persistent_tma \| 29.66325916349888 \| 1.3695051120594144 \| -2.692027748401006 \| \| cutlass_lvl_default \| 29.82821688055992 \| 72.61214569816366 \| -2.150897022812533 \| \| cutlass_lvl_1111 \| 29.476772993803024 \| 67.7428645719774 \| -3.303780857728953 \| \| cutlass_lvl_2222 \| 30.113255605101585 \| 233.84051702311262 \| -1.2158500630212203 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16 +-----------------------+--------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+---------------------+ \| aten \| 30.58255836367607 \| 0.058386584743857384 \| NA \| \| triton \| 29.799651354551315 \| 100.18178300186992 \| -2.559978795150901 \| \| triton_persistent_tma \| 29.362043365836143 \| 1.534341821912676 \| -3.990885861562106 \| \| cutlass_lvl_default \| 29.4346883893013 \| 73.68858492700383 \| -3.7533484305817093 \| \| cutlass_lvl_1111 \| 29.164200648665428 \| 75.44329373072833 \| -4.637799421958348 \| \| cutlass_lvl_2222 \| 29.13798950612545 \| 227.33327346481383 \| -4.7235056020244 \| +-----------------------+--------------------+----------------------+---------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.float16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1656.6237211227417 \| 0.0549461180344224 \| NA \| \| triton \| 1892.8285837173462 \| 2.3174119112081826 \| 14.258208401997386 \| \| triton_persistent_tma \| 1665.332317352295 \| 2.7922237082384527 \| 0.525683419747917 \| \| cutlass_lvl_default \| 1705.5492401123047 \| 108.31571159465238 \| 2.9533272019312116 \| \| cutlass_lvl_1111 \| 1714.9059772491455 \| 17.64627545280382 \| 3.518134829489478 \| \| cutlass_lvl_2222 \| 1680.4152727127075 \| 306.9972395859659 \| 1.4361469829637354 \| +-----------------------+--------------------+----------------------+--------------------+ Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16 +-----------------------+--------------------+----------------------+--------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+--------------------+----------------------+--------------------+ \| aten \| 1621.416687965393 \| 0.06300561130046844 \| NA \| \| triton \| 1782.3902368545532 \| 2.318530729971826 \| 9.927956834535548 \| \| triton_persistent_tma \| 1586.0934257507324 \| 2.7931175641715527 \| -2.178543151605614 \| \| cutlass_lvl_default \| 1657.4617624282837 \| 43.31810224894434 \| 2.2230605328307784 \| \| cutlass_lvl_1111 \| 1641.5367126464844 \| 17.648567833006382 \| 1.2408916739557292 \| \| cutlass_lvl_2222 \| 1645.8417177200317 \| 249.33647010894492 \| 1.5064005407078918 \| +-----------------------+--------------------+----------------------+--------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-13 01:57:47 +00:00
henrylhtsang	66300d3d55	[cutlass backend] try make cutlass backend benchmark more robust (#149015 ) Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/) I want to make sure the benchmark even if failed on some experiment can still print most of the results. ``` Experiment group: mm (3x3, 3x3) torch.bfloat16 +-----------------------+-------------------+----------------------+---------------------+ \| name \| forward_time (us) \| compilation_time (s) \| perf_over_aten (%) \| +-----------------------+-------------------+----------------------+---------------------+ \| aten \| 6.175220478326082 \| 0.5982149520423263 \| NA \| \| triton \| 5.326753947883844 \| 3.2067150759976357 \| -13.739858089605114 \| \| triton_persistent_tma \| 5.340870004147291 \| 3.279932268196717 \| -13.51126615004617 \| \| cutlass_lvl_default \| inf \| inf \| inf \| \| cutlass_lvl_1111 \| inf \| inf \| inf \| \| cutlass_lvl_2222 \| inf \| inf \| inf \| \| cutlass_lvl_3333 \| inf \| inf \| inf \| +-----------------------+-------------------+----------------------+---------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015 Approved by: https://github.com/chenyang78, https://github.com/jingsh	2025-03-12 18:59:49 +00:00
LifengWang	e40a9e602b	Add the max_autotune tests in the periodic jobs. (#143560 ) To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560 Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire	2025-03-12 01:47:46 +00:00
Bin Bao	f69e58e8e8	[CI] Update crossvit_9_240 as pass (#148989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989 Approved by: https://github.com/ZainRizvi	2025-03-11 20:54:39 +00:00
Rengan Xu	da4bb72a71	Backout D70075331 (#148824 ) Summary: The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0" So we revert D70075331 as a workaround now. Test Plan: The model could be lowered and published successfully. e.g. 702869739_16 Differential Revision: D70823254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824 Approved by: https://github.com/eqy	2025-03-11 12:51:17 +00:00
PyTorch MergeBot	ebd087e4b5	Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257 )" This reverts commit `f08146b67b`. Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))	2025-03-10 17:19:21 +00:00
Jason Ansel	a60b4ed623	[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292 ) Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261, #148288	2025-03-10 16:06:19 +00:00
Jason Ansel	8f858e226b	[fx] Optimizations for node name generation (#148288 ) Before: ![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe) After: ![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260, #148261	2025-03-10 16:06:19 +00:00
Jason Ansel	5d4e7d58b4	[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` after: ``` 20003454 function calls (19203257 primitive calls) in 8.936 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261 Approved by: https://github.com/oulgen ghstack dependencies: #148243, #148260	2025-03-10 16:06:11 +00:00
Jason Ansel	bf752c36da	[fx] Move Node._update_args_kwargs to C++ (#148260 ) Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before: ``` 25203549 function calls (24403352 primitive calls) in 12.090 seconds ``` after: ``` 24303536 function calls (23503339 primitive calls) in 10.726 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260 Approved by: https://github.com/oulgen ghstack dependencies: #148243	2025-03-10 16:06:02 +00:00

1 2 3 4 5 ...

2009 Commits