pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Nikita Shulga	975bbc63db	[MPS][BE] Move fmod/remainder to Metal ops (#154280 ) This accomplishes following: - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer) - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph) - Eliminates need for several correctness workarounds Fixes https://github.com/pytorch/pytorch/issues/154171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280 Approved by: https://github.com/dcci ghstack dependencies: #154275, #154290	2025-05-24 01:45:33 +00:00
Nikita Shulga	8f08bdb7f2	[MPS][BE] Code dedup (#154290 ) Eliminate some copy-pasta by introducing `REGISTER_FLOAT_BINARY_OP` and `REGISTER_INTEGER_BINARY_OP` macros Use `_METAL_310_PLUS` to guard bfloat dtype use Pull Request resolved: https://github.com/pytorch/pytorch/pull/154290 Approved by: https://github.com/yangw-dev, https://github.com/wdvr ghstack dependencies: #154275	2025-05-24 01:41:31 +00:00
Nikita Shulga	e5f63f4f66	[CI] Move Mac testing to 3.12 (#154177 ) Prep step to completely move away from Conda during the builds.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154177 Approved by: https://github.com/huydhn, https://github.com/cyyever, https://github.com/atalman ghstack dependencies: #154237, #154268, #154271, #154269, #154270	2025-05-24 01:41:20 +00:00
Catherine Lee	11a490f32f	[CI] Reuse old whl on more workflows (#154285 ) Still only on main branch, not PRs, so that we can monitor Pull Request resolved: https://github.com/pytorch/pytorch/pull/154285 Approved by: https://github.com/malfet	2025-05-24 01:25:35 +00:00
Zhengxu Chen	308beeeb56	[dynamo] Use UUID for compiled function variable names. (#154148 ) Summary: We previously assign each compiled function variable a name based on in-process global counter. This works fine within the same process but when we're trying to serialize the states with precompile, we need a way to load back these compiled functions without causing collision to the existing global scope. Changing the counter to a true global uuid seems to resolve this issue. For example, the new variable name will look like: ``` __compiled_fn_0_7ce7d872_4fe8_4174_b8fd_2496b09b8b43 ``` Test Plan: CI Differential Revision: D75244901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154148 Approved by: https://github.com/jansel	2025-05-24 01:08:42 +00:00
leslie-fang-intel	7ba6fb69e6	[Inductor][CPP] Enable vectorized fp8 E5M2 quant dequant (#153365 ) Summary This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E5M2` `quant` from `float32` and `dequant` to `float32`. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e5m2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153365 Approved by: https://github.com/jansel, https://github.com/jgong5 ghstack dependencies: #152417, #152418, #153364	2025-05-23 23:20:02 +00:00
leslie-fang-intel	84b657d0b5	Add Vectorized FP8 E5M2 (#153364 ) Summary This PR mainly adding the `Vectorized<Float8_e5m2>` class to support the vectorization of `FP8 E5M2` with methods: - Convert to/from `Vectorized<float>` - Common vectorized methods like: `mul`, `abs`, `eq` and etc. Test Plan ``` ./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E5M2Test.* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153364 Approved by: https://github.com/jgong5, https://github.com/CaoE, https://github.com/vkuzo ghstack dependencies: #152417, #152418	2025-05-23 23:11:25 +00:00
leslie-fang-intel	b77a6504fa	[Inductor][CPP] Enable vectorized fp8 quant dequant (#152418 ) Summary This PR enables the vectorization codegen with Inductor CPP backend for `FP8_E4M3` `quant` from `float32` and `dequant` to `float32`. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_dequant_quant_lowering_fp8_e4m3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152418 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/CaoE ghstack dependencies: #152417	2025-05-23 23:05:17 +00:00
leslie-fang-intel	080b74ce67	Add Vectorized FP8 E4M3 (#152417 ) Summary This PR mainly adding the `Vectorized<Float8_e4m3fn>` class to support the vectorization of `FP8 E4M3` with methods: - Convert to/from `Vectorized<float>` - Common vectorized methods like: `mul`, `abs`, `eq` and etc. Test Plan ``` ./build/bin/vec_test_all_types_AVX512 --gtest_filter=FP8E4M3Test.* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152417 Approved by: https://github.com/mingfeima, https://github.com/CaoE, https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/vkuzo	2025-05-23 22:56:56 +00:00
Ting Lu	bab59d3c28	Upgrade to CUDA 12.8.1 for nightly binaries (#152923 ) Upgrade current CUDA 12.8 builds to 12.8.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152923 Approved by: https://github.com/atalman	2025-05-23 22:37:05 +00:00
Scott Wolchok	f0b2706914	remove sleef_arm target (#154166 ) Summary: X-link: https://github.com/pytorch/executorch/pull/11082 We shouldn't need an ARM-specific variant; we have select() where we should need it. Test Plan: CI Reviewed By: nlutsenko Differential Revision: D74356413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154166 Approved by: https://github.com/kimishpatel, https://github.com/malfet, https://github.com/Skylion007	2025-05-23 22:16:01 +00:00
Zain Rizvi	86a160353e	[BE] Don't run windows builds in pull.yml (#154264 ) We already run windows builds and tests [during trunk.yml](`c13eeaa718/.github/workflows/trunk.yml (L115-L130)`). Spot checking for failures of this job in pull.yml shows that the most of the times this job fails, the failure correlates with other build jobs failing as well, so it's not offering much unique signal. Given that we'll run this job before merging the PR as part of trunk.yml anyways, the trade off of extra signal from getting a windows build signal a little earlier doesn't seem worth the infra investment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154264 Approved by: https://github.com/malfet	2025-05-23 22:03:19 +00:00
Catherine Lee	65f0cf3df5	[mergebot] Do not block on autoformat workflow (#154236 ) Helps with https://github.com/pytorch/pytorch/issues/154084 Merge sometimes fails due to autoformat failing. I believe it's because author doesn't have write perms/workflow running perms -> needs approval for workflows. On merge, the bot adds the merge label -> triggers autoformat workflow -> needs approval (even though it will end up getting get skipped because the label doesn't match) -> merge sees and fails So I put an ugly exception for the workflow in mergebot Some restrictions to keep in mind: * Need to checkout the PRs code changes to run lint/format on them -> possible security issue if someone modifies a linter/formatter * The (third party) reusable action used in the autoformat workflow requires the trigger to be pull_request Pull Request resolved: https://github.com/pytorch/pytorch/pull/154236 Approved by: https://github.com/malfet	2025-05-23 22:00:34 +00:00
James Wu	bb17f9c98b	[AOTAutogradCache] Fix CHROMIUM_EVENT_LOG being none (#154258 ) It turns out if you import something that's None at import time in python, and later update the value, the one you imported stays none: ``` import torch from torch._dynamo.utils import CHROMIUM_EVENT_LOG class Foo: pass torch._dynamo.utils.CHROMIUM_EVENT_LOG = Foo() print(CHROMIUM_EVENT_LOG) # None ``` This fixes teh bug so we get AOTAUtogradCache instant events again Differential Revision: [D75305770](https://our.internmc.facebook.com/intern/diff/D75305770/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154258 Approved by: https://github.com/oulgen	2025-05-23 21:53:31 +00:00
Nikita Shulga	0e4f1b8a06	[CI] Update MacOS conda requirmenets (#154270 ) Pick package versions which are compatible with both 3.9 and 3.12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154270 Approved by: https://github.com/clee2000, https://github.com/atalman ghstack dependencies: #154237, #154268, #154271, #154269	2025-05-23 21:44:50 +00:00
Nikita Shulga	5db1503846	[CI] Update MacOS numba and scipy versions (#154269 ) Pick versions that supported by both 3.9 and 3.12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154269 Approved by: https://github.com/clee2000, https://github.com/atalman ghstack dependencies: #154237, #154268, #154271	2025-05-23 21:44:49 +00:00
Howard Huang	aa3eab2ce6	Fix tcp init when using port 0 (#154156 ) I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156 Approved by: https://github.com/d4l3k, https://github.com/Skylion007	2025-05-23 21:41:58 +00:00
Anthony Shoumikhin	3c0b93afc5	Re-enable link linter (#153280 ) And make URL linter always succeed for now. I'll monitor the logs manually and experiment with it futher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153280 Approved by: https://github.com/albanD	2025-05-23 20:56:25 +00:00
Nikita Shulga	6f34d141ab	[MPS][BE] Delete `complex_div` (#154275 ) An absolute no-op: delete `complex_div` from `UnaryKernel.metal` and use identical one from `c10/metal/utils.h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154275 Approved by: https://github.com/dcci	2025-05-23 20:53:50 +00:00
Nikita Shulga	dec6a47996	[BE] Delete unused pip-requirements-iOS.txt (#154271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154271 Approved by: https://github.com/clee2000 ghstack dependencies: #154237, #154268	2025-05-23 20:08:19 +00:00
Nikita Shulga	acd0873d3b	[CI] Fix `TestDynamoTimed.test_ir_count` for 3.12 (#154268 ) Python-3.12 emits the same bytecode as 3.13 for code in question Pull Request resolved: https://github.com/pytorch/pytorch/pull/154268 Approved by: https://github.com/clee2000, https://github.com/atalman ghstack dependencies: #154237	2025-05-23 20:08:19 +00:00
PyTorch MergeBot	28af44285b	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit `499a76b844`. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see `fe784c5a2c/1` ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))	2025-05-23 19:44:08 +00:00
Shangdi Yu	fe784c5a2c	Fix torchbind path in AOTI package loader (#154265 ) Summary: as title, fix the path in package loader and fix the test to take the additional dir into consideration. Test Plan: ``` buck run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:torchbind ``` Reviewed By: angelayi Differential Revision: D75308904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154265 Approved by: https://github.com/clee2000, https://github.com/malfet	2025-05-23 19:32:53 +00:00
PyTorch MergeBot	90855835ff	Revert "[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155 )" This reverts commit `269fa8028f`. Reverted https://github.com/pytorch/pytorch/pull/154155 on behalf of https://github.com/henrylhtsang due to mistake in PR ([comment](https://github.com/pytorch/pytorch/pull/154155#issuecomment-2905514934))	2025-05-23 19:08:40 +00:00
Angela Yi	3b21d79225	[export] Move PT2ArchiveWriter/Reader to torch/export (#153795 ) Summary: Before: `from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package` After: `from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package` By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder. Before: ``` ├── archive_format ├── byteorder ├── .data │ ├── serialization_id │ └── version ├── data │ ├── aotinductor ``` After: ``` ├── tmp │ ├── archive_format │ ├── byteorder │ ├── .data │ │ ├── serialization_id │ │ └── version │ ├── data │ │ ├── aotinductor ``` Test Plan: `buck2 test //sigmoid/...` https://www.internalfb.com/intern/testinfra/testrun/5348024839248187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795 Approved by: https://github.com/zhxchen17	2025-05-23 19:04:36 +00:00
Ke Wen	499a76b844	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-23 19:04:28 +00:00
PyTorch MergeBot	561a11aa68	Revert "Patch the _is_conv_node function (#153749 )" This reverts commit `c985cec5b2`. Reverted https://github.com/pytorch/pytorch/pull/153749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153749#issuecomment-2905504697))	2025-05-23 19:04:20 +00:00
PyTorch MergeBot	4ff19ecf66	Revert "[export] Move PT2ArchiveWriter/Reader to torch/export (#153795 )" This reverts commit `7e80f23516`. Reverted https://github.com/pytorch/pytorch/pull/153795 on behalf of https://github.com/malfet due to Looks like it broke lots of tests, see `ec368a1903/1` ([comment](https://github.com/pytorch/pytorch/pull/153795#issuecomment-2905415496))	2025-05-23 18:29:08 +00:00
Svetlana Karslioglu	ec368a1903	Add sitemap (#154158 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/154158 Approved by: https://github.com/albanD	2025-05-23 18:01:00 +00:00
Andy (An) Wang	0d62fd5c3c	[MTIA Aten Backend][2/n] Migrate clamp ops(clamp.out/clamp_min.out/clamp_max.out) from out-of-tree to in-tree (#154015 ) Summary: # Context See the first PR https://github.com/pytorch/pytorch/pull/153670 # This PR 1. Migrate 3 clamp ops from out-of-tree to in-tree(had to migrate the 3 ops altogether, because clamp.out calls all 3 stubs, which are also called by the other 2 ops): - clamp.out - clamp_min.out - clamp_max.out 2. Also enabled structured kernel codegen for MTIA, which is needed by clamp 3. Also introduced the `--mtia` flag to torchgen to prevent OSS from gencoding MTIA code.(Otherwise we got such link error `lib/libtorch_cpu.so: undefined reference to at::detail::empty_mtia`) Differential Revision: D74674418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154015 Approved by: https://github.com/albanD, https://github.com/nautsimon	2025-05-23 17:59:47 +00:00
Nikita Shulga	bcb2125f0a	[BE][CI] Update expecttest version to 0.3.0 (#154237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154237 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/atalman	2025-05-23 17:27:41 +00:00
Tsung-Hsien Lee	cae25ef4e5	[c10d] Enhance Error Logging in `new_subgroups()` for Non-Divisible World Sizes (#154124 ) Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness. Test Plan: contbuild & OSS CI Differential Revision: D75226925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124 Approved by: https://github.com/wz337	2025-05-23 17:12:43 +00:00
henrylhtsang	e927ba6dbd	[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335 ) Motivation: By default, we are tuning the cutlass backend kernels on 3 swizzles. There are runtime params, so they share the same underlying kernel, which saves a lot of compilation time. However, autotuning all combinations of {configs} x {swizzles} is still expensive. Observations: Winner of the {configs} x {swizzles} autotuning is the same as if we do a greedy search: first find the top X winners of {configs} with swizzle 2 (hardcoded), then autotune on the {top X winner configs} x {swizzles}. In other words, we can use a Greedy algorithm to reduce autotuning time. I attach the logs below. This somewhat depends on what X is, but a number like 5-10 works pretty well from empirical observations. Logs: Baseline: https://gist.github.com/henrylhtsang/9a604f150a270dc19524f72a5d4dfac2 ``` AUTOTUNE mm(2048x2048, 2048x2048) strides: [2048, 1], [1, 2048] dtypes: torch.bfloat16, torch.bfloat16 cuda_cutlass_gemm_1776 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1777 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1778 0.0291 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1800 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1801 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1802 0.0293 ms 99.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9012 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9013 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9014 0.0294 ms 98.9% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8940 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8941 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8942 0.0296 ms 98.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8934 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8935 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8936 0.0297 ms 98.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_2001 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_2002 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_2003 0.0297 ms 97.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1848 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1849 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1850 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8964 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8965 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8966 0.0298 ms 97.6% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_8958 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_8959 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_8960 0.0298 ms 97.5% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1929 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1930 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1931 0.0302 ms 96.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1770 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1771 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1772 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1953 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1954 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1955 0.0302 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_tnn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1995 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1996 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1997 0.0303 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1794 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1795 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1796 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1842 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_1843 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_1844 0.0303 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_9006 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cuda_cutlass_gemm_9007 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cuda_cutlass_gemm_9008 0.0304 ms 95.7% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cuda_cutlass_gemm_1923 0.0306 ms 95.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 ``` with prescreening: ``` AUTOTUNE mm(147456x6144, 6144x2048) strides: [6144, 1], [2048, 1] dtypes: torch.bfloat16, torch.bfloat16 cutlass_1a5e81af 4.5469 ms 100.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6328 ms 98.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.6836 ms 97.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_161b8b81 4.7224 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_161b8b81 4.7234 ms 96.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7274 ms 96.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_853b6347 4.7369 ms 96.0% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_aa6f899c 4.7404 ms 95.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_161b8b81 4.7711 ms 95.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8148 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_8bc6fbda 4.8159 ms 94.4% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_8bc6fbda 4.8214 ms 94.3% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_8bc6fbda 4.8302 ms 94.1% cutlass3x_sm90_tensorop_s64x256x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_0a1c55af 4.8487 ms 93.8% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_0a1c55af 4.8527 ms 93.7% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_02780d72 4.8617 ms 93.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_0a1c55af 4.8737 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_0a1c55af 4.8738 ms 93.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_02780d72 4.9348 ms 92.1% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_02780d72 4.9763 ms 91.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_853b6347 4.9805 ms 91.3% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.0225 ms 90.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.0271 ms 90.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_02780d72 5.0595 ms 89.9% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_853b6347 5.1434 ms 88.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.1574 ms 88.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8 cutlass_1a5e81af 5.1916 ms 87.6% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2018 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=4 cutlass_c1ffa14b 5.2019 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=1 cutlass_c1ffa14b 5.2037 ms 87.4% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_256x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_1a5e81af 5.5329 ms 82.2% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=2 cutlass_aa6f899c 11.5046 ms 39.5% cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x256x64_1x2x1_0_ttn_align8_warpspecialized_cooperative_epi_tma swizzle=8 SingleProcess AUTOTUNE benchmarking takes 1.9526 seconds and 0.0352 seconds precompiling for 32 choices ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153335 Approved by: https://github.com/eellison	2025-05-23 17:12:25 +00:00
Shangdi Yu	04a6fe7914	Update provenance tracking doc (#154062 ) Summary: Update the doc to reflect the changes in https://github.com/pytorch/pytorch/pull/153584/files#diff-e0cdb58c0f84f56f20c5433339b6d83c470dcde47847e2328effea6bedd4cd27 and https://github.com/pytorch/tlparse/pull/110 Test Plan: CI Differential Revision: D75155981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154062 Approved by: https://github.com/svekars, https://github.com/desertfire	2025-05-23 17:09:52 +00:00
Aleksei Nikiforov	7d8ea5db69	Disable cache and utilization stats uploading steps on s390x (#150297 ) There are no AWS credentials available on s390x runners. These steps are failing anyway due to that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150297 Approved by: https://github.com/seemethere	2025-05-23 16:49:38 +00:00
Angela Yi	7e80f23516	[export] Move PT2ArchiveWriter/Reader to torch/export (#153795 ) Summary: Before: `from sigmoid.core.package.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_sigmoid_package` After: `from torch.export.pt2_archive import PT2ArchiveWriter, PT2ArchiveReader, is_pt2_package` By merging the two PT2ArchiveReader/Writers, into using the native PytorchFileReader/Writer, the open source PT2 archive also changed to have an additional folder. However this PR still maintains support for loading an old PT2 archive which does not have the additional folder. Before: ``` ├── archive_format ├── byteorder ├── .data │ ├── serialization_id │ └── version ├── data │ ├── aotinductor ``` After: ``` ├── tmp │ ├── archive_format │ ├── byteorder │ ├── .data │ │ ├── serialization_id │ │ └── version │ ├── data │ │ ├── aotinductor ``` Test Plan: `buck2 test //sigmoid/...` https://www.internalfb.com/intern/testinfra/testrun/5348024839248187 Differential Revision: D74616598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153795 Approved by: https://github.com/zhxchen17	2025-05-23 15:40:25 +00:00
Nikita Shulga	214e4cef9f	Fix RMSNorm doc rendering (#154205 ) By removing `::func::` decorator which adds unneeded parenthesis Test plan: Check https://docs-preview.pytorch.org/pytorch/pytorch/154205/generated/torch.nn.RMSNorm.html#rmsnorm that now renders as <img width="704" alt="image" src="https://github.com/user-attachments/assets/443f605d-75a6-41ef-8971-21e7dc8ef9f6" /> Fixes https://github.com/pytorch/pytorch/issues/154184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154205 Approved by: https://github.com/mikaylagawarecki	2025-05-23 15:39:29 +00:00
Laith Sakka	9e089bb5b6	change guard_or impl for better perf and simplicity (#153674 ) PR time benchmarks has been showing regressions as we move to guard_or_false, reason is that prev implementation do not cache. This new approach will propagate the fallback value to eval and return it. allowing eval to cache and reducing scamming logs and complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153674 Approved by: https://github.com/bobrenjc93	2025-05-23 15:24:28 +00:00
Aaron Orenstein	4b7abce6a4	Fix fake tensor caching when output has unbacked (#153034 ) We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. The latest version of this also: 1. Addresses the problem that caused #153891. The issue was that with caching ops are required to support `__eq__`. Unfortunately _RecordFunction is minimalistic and doesn't support that - so in the off-chance that two keys hash to the same value the `__eq__` check would raise an exception. Apparently this was much more common on MacOS where memory patterns end up with more reuse (so the object IDs are the same and give you the same hash value for objects that use pointer hash). Tested locally on MacOS where running ``` python test/inductor/test_torchinductor.py GPUTests ``` was pretty much guaranteed to fail (at least for me) somewhere around test 100-200 and passed all 800 tests after this change. Another way to test this is to run the inductor tests with `torch._subclasses.fake_tensor._DispatchCacheKey.__hash__` monkey-patched to return a constant (causing all values to hash-collide) but this can't really be checked-in since it causes the cache lookup to turn into an O(n) lookup which takes a crazy long time to run through all the tests... 2. Folds in #153780 to ensure that exceptions raised from the op don't include the context from the cache key bypass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034 Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan	2025-05-23 15:03:31 +00:00
PyTorch MergeBot	866142ff16	Revert "Update the heuristic for AArch64 bmm/baddbmm (#149122 )" This reverts commit `d759a517af`. Reverted https://github.com/pytorch/pytorch/pull/149122 on behalf of https://github.com/jeanschmidt due to breaking internal models, @malfet may you help merge this? ([comment](https://github.com/pytorch/pytorch/pull/149122#issuecomment-2904703075))	2025-05-23 14:54:54 +00:00
Nikita Shulga	5859582ee4	[BE][MPS] Delete unused `complex_mul_out` (#154175 ) It's no longer called, after `mul` has been migrated to binary op Pull Request resolved: https://github.com/pytorch/pytorch/pull/154175 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-05-23 13:44:24 +00:00
Jonathan Deakin	2225231a14	Enable AArch64 CI scripts to be used for local dev (#143190 ) - Allow user to specify custom ComputeLibrary directory, which is then built rather than checking out a clean copy - Remove `setup.py clean` in build. The CI environment should be clean already, removing this enables incremental rebuilds - Use all cores for building ComputeLibrary Mostly a port of https://github.com/pytorch/builder/pull/2028 with the conda part removed, because aarch64_ci_setup.sh has changed and can now handle being called twice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143190 Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet Co-authored-by: David Svantesson-Yeung <David.Svantesson-Yeung@arm.com>	2025-05-23 12:09:59 +00:00
Ke Wen	25149cd173	[c10d] Add more tests to prevent extra context (#154174 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Loop a bunch of sync ops and see if any of them creates extra context. Requires nvml to check number of processes resident on a device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154174 Approved by: https://github.com/atalman	2025-05-23 09:54:01 +00:00
wengshiy	ba5d45d22e	Add assertion to align with cuda (#153233 ) Fixes #153137 Aligned batch_norm_cpu_out assertion to [batch_norm_cuda_out](`a7ea115494/aten/src/ATen/native/cuda/Normalization.cu (L436)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153233 Approved by: https://github.com/malfet	2025-05-23 07:32:43 +00:00
Autin Mitra	5623d30228	[Minimizer] Gracefully exit when there is no discrepancy in block mode (#154076 ) Summary: Previously, when there is no discrepancy in results for block mode, net_min_base will throw an OOB error. This occurs due to the block _block_traverse_impl returning an OOB after exhausting subgraphs all the way down to a single node There is also an issue where we may get an unsound subgraph (i.e. mark an earlier node as the "end" even if the correct end is later). This is due to an incorrect check (start_idx == mid) where there can possibly be two values left before the program pre-maturely returns Test Plan: Buck UI: https://www.internalfb.com/buck2/52524c26-ace5-4593-8a4b-843a54eb206a Test UI: https://www.internalfb.com/intern/testinfra/testrun/3096224973363310 Network: Up: 0B Down: 15MiB (reSessionID-cd404e97-395f-49fc-8381-373e90a1378f) Executing actions. Remaining 0/1 Command: test. Time elapsed: 53.7s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D75143242 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154076 Approved by: https://github.com/jfix71	2025-05-23 06:42:07 +00:00
Filip Jankovic	8342b9371e	[ROCm] Prefer hipblaslt for gfx1200, gfx1201 (#153610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153610 Approved by: https://github.com/jeffdaily, https://github.com/atalman	2025-05-23 06:01:53 +00:00
angelayi	26471fc203	[aoti] Initial Metal support (#153959 ) An example generated file: P1816629015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153959 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: #153964	2025-05-23 05:45:35 +00:00
angelayi	b33b7d5c8c	[aoti] Add MPS runner and shim (#153964 ) Added AOTIModelContainerRunnerMps and a shim for mps fallback ops. I also added a mps-specific shim which contains one operator, which will be used to set arguments being passed to the Metal kernel: ``` AOTI_TORCH_EXPORT AOTITorchError aoti_torch_mps_set_arg( AOTIMetalKernelFunctionHandle func, unsigned idx, AtenTensorHandle tensor); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153964 Approved by: https://github.com/malfet, https://github.com/desertfire	2025-05-23 05:45:35 +00:00
henrylhtsang	269fa8028f	[AOTI][cutlass backend] Do not remove the cutlass kernel .o file after packaging (#154155 ) Differential Revision: [D75253009](https://our.internmc.facebook.com/intern/diff/D75253009/) In general, we want to cache the cutlass kernels. Also saw an error saying .o not found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154155 Approved by: https://github.com/chenyang78	2025-05-23 04:51:36 +00:00
William Wen	5bb156a7fd	[dynamo] raise observed exception for module attribute errors (#153659 ) Fixes https://github.com/pytorch/pytorch/issues/153605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153659 Approved by: https://github.com/StrongerXi	2025-05-23 03:56:26 +00:00

1 2 3 4 5 ...

88238 Commits