pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	2699f5410b	Revert "[xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454 )" This reverts commit `fd68d409ad`. Reverted https://github.com/pytorch/pytorch/pull/162454 on behalf of https://github.com/atalman due to internal build failure ([comment](https://github.com/pytorch/pytorch/pull/162454#issuecomment-3475009089))	2025-10-31 21:58:52 +00:00
fengqing.lu	fd68d409ad	[xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454 ) This is the second PR split from https://github.com/pytorch/pytorch/pull/156272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162454 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-10-31 11:20:38 +00:00
Janani Sriram	ff46d5a79b	[Inductor][Triton][FP8] Support deepseek-style scaling in Inductor (#164404 ) Summary: Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results. DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency: - Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`) - Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`) This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency. NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling. NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation. Test Plan: In OSS PyTorch, run ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 \| tee ~/personal/deepseek_style/deepseek_bench.log ``` Differential Revision: D83609850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404 Approved by: https://github.com/slayton58	2025-10-28 03:38:54 +00:00
Maggie Moss	c7eee49525	Fix pyrefly ignores 1/n (#166239 ) First diff adjusting the syntax for pyrefly: ignore suppressions so they only hide one class of type error. Test: lintrunner pyrefly check Pull Request resolved: https://github.com/pytorch/pytorch/pull/166239 Approved by: https://github.com/oulgen	2025-10-26 00:44:10 +00:00
Yuanyuan Chen	fdab48a7c1	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 07:36:18 +00:00
PyTorch MergeBot	24520b8386	Revert "Enable all PIE rules on ruff (#165814 )" This reverts commit `c79dfdc655`. Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))	2025-10-18 07:21:08 +00:00
Yuanyuan Chen	c79dfdc655	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 06:40:12 +00:00
Yuanyuan Chen	fbe0d20a17	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-14 14:22:54 +00:00
can-gaa-hou	39161e73fc	[Fix] missing lambda in torch._check (#165043 ) Fixes more missing lambda in torch._check in the source code. Inspired by #164225. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165043 Approved by: https://github.com/FFFrog, https://github.com/Skylion007	2025-10-10 17:11:55 +00:00
PyTorch MergeBot	b8be796a57	Revert "[2/N] More ruff SIM fixes (#165031 )" This reverts commit `38095fbd13`. Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870))	2025-10-10 13:42:14 +00:00
Yuanyuan Chen	38095fbd13	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-10 05:37:46 +00:00
Laith Sakka	7158aa22e8	remove more (#164753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753 Approved by: https://github.com/aorenste, https://github.com/mlazos ghstack dependencies: #164664, #164665, #164667, #164668	2025-10-08 14:23:38 +00:00
Colin Peppler	2855a045b3	Use sym_eq and sym_and on symbolic shapes in common_meta_baddbmm_bmm (#164781 ) Differential Revision: D84005053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164781 Approved by: https://github.com/Skylion007	2025-10-07 18:25:00 +00:00
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit `321e602692`. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Maggie Moss	1051c1de5c	Add pyrefly suppressions 2/n (#164513 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check --- step 1: uncomment lines in the `pyrefly.toml` file before: https://gist.github.com/maggiemoss/911b4d0bc88bf8cf3ab91f67184e9d46 after: ``` INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml` INFO 0 errors (1,152 ignored) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164513 Approved by: https://github.com/oulgen	2025-10-03 02:46:13 +00:00
Yuanyuan Chen	315ffdc1e4	[4/N] Apply ruff UP035 rule to python code (#164206 ) Follows #164104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164206 Approved by: https://github.com/albanD	2025-10-01 19:05:53 +00:00
Yuanyuan Chen	cc8b14d09a	[2/N] Simplify "in" operation for containers of a single item (#164323 ) These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164323 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-10-01 05:39:11 +00:00
ankushwahaRH	7f29c47a4f	Fix cdist export compute mode validation (#161724 ) Fixes #161089. Added '0' as the acceptable value for compute mode in _meta_registrations.py. Also, added a test case in test_export.py file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161724 Approved by: https://github.com/albanD, https://github.com/angelayi	2025-09-30 12:23:20 +00:00
Yavuz Yetim	7afcb030d8	Back out "Revert D81959389" (#163905 ) Summary: Original commit changeset: 06888d7ebff0 Original Phabricator Diff: D82932788 Restricted the test to SM90 for scaled_grouped_mm Test Plan: TBD (will share the linux CI results) Differential Revision: D83283991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905 Approved by: https://github.com/angelayi	2025-09-30 07:05:13 +00:00
can-gaa-hou	eb4361a801	[Fix] Adding missing `f` prefixes to formatted strings [1/N] (#164065 ) As stated in the title. * #164068 * #164067 * #164066 * __->__ #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164065 Approved by: https://github.com/Skylion007	2025-09-29 04:53:00 +00:00
thenumberouscode	c106ee8515	[FakeTensor] Supplement the relevant logic for converting conv1d to conv2d in meta_conv (#160408 ) ## Fixes https://github.com/pytorch/pytorch/issues/159462 also fixes #163569 , #163604 ## summary the issue is caused by the wrong stride of conv1d's result generated by meta_conv: `4d5b3f2d5a/torch/_meta_registrations.py (L2453-L2471)` and the wrong stride will be used to codegen size assert in inductor: `4d5b3f2d5a/torch/_inductor/ir.py (L6152-L6163)` ## reason So why the computed stride is wrong in the meta_conv function? because the corresponding backend will convert conv1d to conv2d and change the input tensor' size and memory_format(channel last). but the meta_conv do not do this transformation, so a mismatch happend. `4d5b3f2d5a/aten/src/ATen/native/Convolution.cpp (L1502-L1510)` just add corresponding logic in meta_conv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160408 Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/mlazos	2025-09-26 15:45:02 +00:00
Yidi Wu	21a41edd4f	Add fake_impl for _native_multi_head_attention (#163700 ) Test Plan: See added test in test_export.py Differential Revision: D83099187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163700 Approved by: https://github.com/angelayi	2025-09-25 19:01:27 +00:00
Jason Ansel	9c4d9f940b	[inductor] Support out_dtype arg to matmul (#163393 ) Fixes #163275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393 Approved by: https://github.com/eellison, https://github.com/coconutruben ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434	2025-09-23 15:37:38 +00:00
PyTorch MergeBot	aff76c046d	Revert "Add fake_impl for _native_multi_head_attention (#163167 )" This reverts commit `27164b6788`. Reverted https://github.com/pytorch/pytorch/pull/163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see `1a42656d6c/1` ([comment](https://github.com/pytorch/pytorch/pull/163167#issuecomment-3324302026))	2025-09-23 14:36:45 +00:00
Yidi Wu	27164b6788	Add fake_impl for _native_multi_head_attention (#163167 ) Test Plan: See added test in test_export.py Rollback Plan: Reviewed By: henryoier Differential Revision: D77747446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163167 Approved by: https://github.com/angelayi	2025-09-23 04:02:20 +00:00
Aart J.C. Bik	9b5ec0ff7c	Use computed buffer sizes of torch for cusparseLt metadata (#163125 ) Making sure buffer allocation matches what is computed by cusparseLt compression Pull Request resolved: https://github.com/pytorch/pytorch/pull/163125 Approved by: https://github.com/jcaip	2025-09-19 22:12:40 +00:00
Eddie Yan	9b7a8c4d05	[cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104 ) For https://github.com/pytorch/torchtitan/issues/1713 Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time `1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104 Approved by: https://github.com/drisspg	2025-09-17 15:48:54 +00:00
Daniel Vega-Myhre	872ed60679	[mxfp8 torch._scaled_grouped_mm] fix meta registration for 3d tensor (#162765 ) Meta registration checks for torch._scaled_grouped_mm has a bug for 3d "B" tensors. Namely, the scale shape for such a tensor should be 2d with shape (G, blocked_K * blocked_N), but it currently enforces an expected 3d shape of (G, blocked_K, blocked_N). See Blas.cpp for correct validation logic [here](`8e217a9f6d/aten/src/ATen/native/cuda/Blas.cpp (L1622)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162765 Approved by: https://github.com/ngimel	2025-09-12 03:51:52 +00:00
Pian Pawakapan	ac72f81c12	[dynamic shapes] unbacked-safe should_swap (#160473 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/160473 Approved by: https://github.com/laithsakka	2025-09-11 18:51:25 +00:00
Daniel Vega-Myhre	b6d0a9ea90	MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump (#162209 ) ## Summary - We just landed 2d-2d support for mxfp8 grouped gemm in FBGEMM: https://github.com/pytorch/FBGEMM/pull/4816 - This is needed for backward pass of mxfp8 MoE training with grouped gemms - Changes: - Add dispatching + input validation for mxfp8 grouped gemm in `torch._scaled_grouped_mm` - Add meta registration input validation for mxfp8 grouped gemm, for composability with compile - Add unit tests exercising torch._scaled_grouped_mm with mxfp8 inputs - Bump FBGEMM third party submodule to include: - https://github.com/pytorch/FBGEMM/pull/4816 - https://github.com/pytorch/FBGEMM/pull/4820 - https://github.com/pytorch/FBGEMM/pull/4821 - https://github.com/pytorch/FBGEMM/pull/4823 #### How fbgemm dependency was bumped Documenting this since I haven't found it documented elsewhere: - `cd ~/pytorch/third_party/fbgemm` - `git fetch` - `git checkout <hash>` - `cd ~/pytorch` - `git add third_party/fbgemm` ## Test plan #### Test build ``` USE_FBGEMM_GENAI=1 python -m pip install --no-build-isolation -v -e . ... Successfully installed torch-2.9.0a0+gitf5070f3 ``` [full build log](https://www.internalfb.com/phabricator/paste/view/P1933787581) #### Unit tests ``` pytest test/test_matmul_cuda.py -k test_mxfp8_scaled_grouped_mm_ ... test/test_matmul_cuda.py ......... [100%] ============================================================== 9 passed, 1668 deselected in 5.34s =============================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162209 Approved by: https://github.com/ngimel	2025-09-06 15:25:30 +00:00
Laith Sakka	fbf3d2027d	use sym_or instead of any to avoid dde in calc_conv_nd_return_shape (#162084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162084 Approved by: https://github.com/aorenste Co-authored-by: Aaron Orenstein <aorenste@fb.com>	2025-09-04 01:20:22 +00:00
angelayi	e34b6a0103	Add meta for add.Scalar (#161332 ) Fixes https://github.com/pytorch/pytorch/issues/161076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161332 Approved by: https://github.com/Skylion007	2025-08-26 02:26:51 +00:00
Isuru Fernando	e631557518	Fix meta function for aten.complex (#160894 ) Closes https://github.com/pytorch/pytorch/issues/160882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160894 Approved by: https://github.com/mlazos	2025-08-20 16:30:04 +00:00
Isuru Fernando	781e9a7724	Fix meta for constant_pad_nd (#159878 ) Fixes https://github.com/pytorch/pytorch/issues/144187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159878 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2025-08-14 14:47:47 +00:00
angelayi	74a754aae9	Add meta kernel for sdpa_math_for_mps (#159695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695 Approved by: https://github.com/malfet ghstack dependencies: #159456	2025-08-05 22:27:06 +00:00
Shangdi Yu	bc4b04e058	DeviceCopy should have the same layout as input (#159615 ) Summary: Fix https://github.com/pytorch/pytorch/issues/159612 - Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input - The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy Test Plan: ``` buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy ``` Rollback Plan: Differential Revision: D79411407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615 Approved by: https://github.com/eellison	2025-08-04 23:56:58 +00:00
Natalia Gimelshein	a81ffbc5f5	improve shape checks for grouped_mm (#159666 ) Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666 Approved by: https://github.com/danielvegamyhre, https://github.com/eqy	2025-08-02 00:12:25 +00:00
Chris Thi	c400c8e2e0	[ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075 ) Summary: In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](`9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)`), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950. The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4. The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds. Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm. Test Plan: Hipify & build ``` python tools/amd_build/build_amd.py USE_FBGEMM_GENAI=1 python setup.py develop ``` Unit tests ``` python test/test_matmul_cuda.py -- TestFP8MatmulCUDA Ran 488 tests in 32.969s OK (skipped=454) ``` Performance Sample \| G \| M \| N \| K \| Runtime Ms \| GB/S \| TFLOPS \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| 128 \| 1 \| 2048 \| 5120 \| 0.37\| 3590 \| 7.17 \| \| 128 \| 64 \| 2048 \| 5120 \| 0.51\| 2792 \| 338.34 \| \| 128 \| 128 \| 2048 \| 5120 \| 0.66\| 2272 \| 522.72 \| \| 128 \| 1 \| 5120 \| 1024 \| 0.21\| 3224 \| 6.43 \| \| 128 \| 64 \| 5120 \| 1024 \| 0.29\| 2590 \| 291.40 \| \| 128 \| 128 \| 5120 \| 1024 \| 0.40\| 2165 \| 434.76 \| \| 128 \| 1 \| 4096 \| 4096 \| 0.69\| 3126 \| 6.25 \| \| 128 \| 64 \| 4096 \| 4096 \| 0.85\| 2655 \| 324.66 \| \| 128 \| 128 \| 4096 \| 4096 \| 1.10\| 2142 \| 501.40 \| \| 128 \| 1 \| 8192 \| 8192 \| 2.45\| 3508 \| 7.01 \| \| 128 \| 64 \| 8192 \| 8192 \| 3.27\| 2692 \| 336.74 \| \| 128 \| 128 \| 8192 \| 8192 \| 4.04\| 2224 \| 543.76 \| \| 16 \| 1 \| 2048 \| 5120 \| 0.04\| 3928 \| 7.85 \| \| 16 \| 64 \| 2048 \| 5120 \| 0.05\| 3295 \| 399.29 \| \| 16 \| 128 \| 2048 \| 5120 \| 0.07\| 2558 \| 588.69 \| \| 16 \| 1 \| 5120 \| 1024 \| 0.03\| 3119 \| 6.23 \| \| 16 \| 64 \| 5120 \| 1024 \| 0.03\| 2849 \| 320.62 \| \| 16 \| 128 \| 5120 \| 1024 \| 0.05\| 2013 \| 404.11 \| \| 16 \| 1 \| 4096 \| 4096 \| 0.06\| 4512 \| 9.02 \| \| 16 \| 64 \| 4096 \| 4096 \| 0.09\| 3124 \| 381.95 \| \| 16 \| 128 \| 4096 \| 4096 \| 0.13\| 2340 \| 547.67 \| \| 16 \| 1 \| 8192 \| 8192 \| 0.32\| 3374 \| 6.75 \| \| 16 \| 64 \| 8192 \| 8192 \| 0.42\| 2593 \| 324.28 \| \| 16 \| 128 \| 8192 \| 8192 \| 0.53\| 2120 \| 518.36 \| - Using ROCm 6.4.1 - Collected through `triton.testing.do_bench_cudagraph` Binary size with gfx942 arch Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so After: 118860960 Jul 23 14:29 build/lib/libtorch_hip.so The difference is 2757104 bytes (~2.6 MiB). Reviewers: @drisspg @ngimel @jwfromm @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075 Approved by: https://github.com/drisspg	2025-07-30 23:53:58 +00:00
Laith Sakka	aaa384b2d4	move view_meta to fake impl (#158406 ) Python dispatcher is not always enabled in fake tensors and have to be called explicitly. While it should be, it requires some work to get all tests working. I have been running in several issues where I add to add enable_python_dispatcher ex XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406 Approved by: https://github.com/bobrenjc93	2025-07-25 08:21:27 +00:00
Laith Sakka	0b2ef76e85	DDE-Free select with unbacked index. (#157605 ) When select has data dependent input, we cant tell if the actual index shall be index+size or index. to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the output view and we compute its value dynamically at runtime when inductor is lowered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605 Approved by: https://github.com/ColinPeppler	2025-07-24 20:08:05 +00:00
PyTorch MergeBot	23550ab735	Revert "DDE-Free select with unbacked index. (#157605 )" This reverts commit `79d7c754ab`. Reverted https://github.com/pytorch/pytorch/pull/157605 on behalf of https://github.com/laithsakka due to fail pr time benchmarks ([comment](https://github.com/pytorch/pytorch/pull/157605#issuecomment-3084663020))	2025-07-17 16:20:02 +00:00
Laith Sakka	79d7c754ab	DDE-Free select with unbacked index. (#157605 ) When select has data dependent input, we cant tell if the actual index shall be index+size or index. to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the output view and we compute its value dynamically at runtime when inductor is lowered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605 Approved by: https://github.com/ColinPeppler	2025-07-17 05:08:11 +00:00
Aleksandar Samardžić	90618581e9	Fix grouped MM output strides when compiled but not max-autotuned (#158143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158143 Approved by: https://github.com/ngimel	2025-07-15 11:53:13 +00:00
wengshiy	c8c221c0b3	[Inductor][Float8] Add float8_e4m3fn into assertion dtype list. (#157684 ) Fix assert issue. Add float8_e4m3fn into dtype list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157684 Approved by: https://github.com/Xia-Weiwen, https://github.com/leslie-fang-intel, https://github.com/jansel	2025-07-15 06:02:01 +00:00
Valentine233	1f57e0e04d	[CPU] Support GQA for flash attention (#157893 ) As many models require GQA, we support it in flash attention for CPU path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157893 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-07-13 09:49:02 +00:00
Xia, Weiwen	e1a20988f3	[Quant][CPU] Enable fp8 qconv (#157076 ) Summary Enable fp8 qconv on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qconv op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qconv op is not changed either. So, the FP8 qconv shares the same op as INT8 qconv and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library. Note: OneDNN does not support quantized fp8 convolution until v3.9 but the version used in PyTorch is v3.7.2. So, the op goes to the reference kernel for now. And we have also update the oneDNN path so that it's compatible with the fp8 dtype. Once oneDNN is upgraded to v3.9 or newer, minimum changes are needed to enable the oneDNN path. And we have ensured that the behavior of the reference kernel is the same as the new oneDNN's implementation. - oneDNN version < 3.9 (now) - Always go to the reference kernel - oneDNN version >= 3.9 (future) - Go to reference kernel on old platforms (without AMX) - Use oneDNN on new platforms (with AMX) Test plan ``` pytest test/quantization/core/test_quantized_op.py -k "qconv and fp8" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157076 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-07-11 10:00:57 +00:00
Aleksandar Samardžić	a3ec6d64b2	Update test after CUTLASS upgrade (#157903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157903 Approved by: https://github.com/ngimel	2025-07-10 20:10:20 +00:00
Xuehai Pan	4cc8b60d1b	[BE][1/16] fix typos in torch/ (#156311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311 Approved by: https://github.com/albanD	2025-07-09 11:02:22 +00:00
Laith Sakka	ed5d6d2a20	python definitely_contiguous-> is_contiguous_or_false (#156515 ) We probably can avoid having those in python as well and just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515 Approved by: https://github.com/bobrenjc93	2025-06-30 17:31:51 +00:00

1 2 3 4 5 ...

617 Commits