pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
atalman	a25818cf7e	Fix image display on pypi project description section (#166404 ) Fixes https://github.com/pytorch/pytorch/issues/165559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166404 Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/Camyll	2025-10-28 18:58:24 +00:00
Elana	e3e93c7107	[MPS] Fix random in-place ops on non-contiguous tensors (#165267 ) Random in-place operations (normal_, uniform_, exponential_, bernoulli_, random_) were silently failing on non-contiguous tensors on macOS < 15.0. * Added needsGather check and scatter-back logic to handle non-contiguous output tensors, following the pattern used in PointwiseOps. * Adds test to confirm these now work * Remove pre-macOS15 xfail for test_Dropout Fixes #165257 and #124029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165267 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-28 18:43:22 +00:00
Nikita Shulga	1abfa5f70b	[EZ][MPS] Improve distribution error checking (#166425 ) Essentially not allow ops on self-overlapping outputs, by adding `at::assert_no_internal_overlap(self);` check that already used in CPU and CUDA builds, see `895795f07c/aten/src/ATen/native/DistributionTemplates.h (L366)` This fixes `test_error_inputs_bernoulli_mps` Should be landed ahead of https://github.com/pytorch/pytorch/pull/165267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166425 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2025-10-28 18:42:12 +00:00
Bin Bao	687c15c0b3	[AOTI][BE] Change test_aoti_inference to one-pass build (#164277 ) Summary: To fix https://github.com/pytorch/pytorch/issues/159400. Currently, test_aoti_abi_check and test_aoti_inference need to be built in two passes, first build pytorch using the regular `pythonsetup.py develop` and then build with `CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 python setup.py devleop`. This is cumbersome. Fix by rewriting CMakeLists.txt for test_aoti_inference to one-pass build which runs AOTI to compile models at the test time. Also update CI test script to get rid of two-pass build. For test_aoti_abi_check, it is not AOTI specific, so we make it not guarded by BUILD_AOT_INDUCTOR_TEST. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164277 Approved by: https://github.com/janeyx99	2025-10-28 17:43:22 +00:00
Jeff Daily	895795f07c	[ROCm][CI] forward fix kineto submodule bump (#166421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166421 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-28 17:40:23 +00:00
Aaron Orenstein	2dc56456cb	refactor: pull _replace_node common functionality out of Scheduler.finalize_multi_template_buffers (#163368 ) Pull replace_node function out of Scheduler.finalize_multi_template_buffers(). This is needed by the next PR (#163369). As part of this also pull the _replace_operation_buffer() up to top-level since it needed no self references. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163368 Approved by: https://github.com/PaulZhang12	2025-10-28 17:21:52 +00:00
Edward Yang	8110ce02a2	Add a skill for writing skills (#166266 ) Apparently, if you just ask Claude to write a skill it doesn't follow the correct rules. So this one is just the official docs for skills. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166266 Approved by: https://github.com/Skylion007 ghstack dependencies: #166265	2025-10-28 16:49:27 +00:00
Edward Yang	43c30f607e	Use correct layout convention for skills (#166265 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166265 Approved by: https://github.com/Skylion007	2025-10-28 16:49:27 +00:00
Simon Layton	5ebf74a655	[2/2] Move scaled_mm routines to their own file (#166314 ) Summary: * Further simplify `ATen/native/cuda/Blas.cpp` by moving `_scaled_mm`, `_scaled_mm_v2` and supporting methods to a new file, `ATen/native/cuda/ScaledBlas.cpp` Test Plan: ``` pytest -svv test/test_matmul_cuda.py pytest -svv test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166314 Approved by: https://github.com/eqy ghstack dependencies: #166313	2025-10-28 16:35:32 +00:00
Simon Layton	acd936cc1a	[1/2] Split `cublasCommonArgs` into its own file (#166313 ) Summary: * Factor out `cublasCommonArgs` struct * Necessary for factoring out scaled mm routines Test Plan: ``` pytest -svv test/test_matmul_cuda.py pytest -svv test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166313 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-10-28 16:35:32 +00:00
PyTorch MergeBot	a4a0378e6b	Revert "[cuDNN] Smoke-test runtime cuDNN version matches compile time version in CI (#165922 )" This reverts commit `2a5f87decf`. Reverted https://github.com/pytorch/pytorch/pull/165922 on behalf of https://github.com/atalman due to cudnn update started to fail, see https://github.com/pytorch/pytorch/pull/165913#issuecomment-3457293475 ([comment](https://github.com/pytorch/pytorch/pull/165922#issuecomment-3457389406))	2025-10-28 16:29:29 +00:00
Prachi Gupta	ac841267a1	[ROCm] skip AsyncTP test class as AsyncTP is not supported on ROCm (#166316 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/166316 Approved by: https://github.com/jeffdaily	2025-10-28 16:23:46 +00:00
PyTorch MergeBot	0eacd934bc	Revert "Update cuDNN 9.10.2 in Manylinux 2.28 Docker files (#165913 )" This reverts commit `840d63c12d`. Reverted https://github.com/pytorch/pytorch/pull/165913 on behalf of https://github.com/clee2000 due to I think something here is causing CI tests to segfault at exit on cuda, ex [GH job link](https://github.com/pytorch/pytorch/actions/runs/18857880394/job/53811917713) [HUD commit link](`9a91486e45`) says no tests failed but it segfaulted afterwards. I can't tell if it's because of this change, or an unpinned dependency in docker that got triggered by this. Note to self, would have been bad TD except trunk didn't run either ([comment](https://github.com/pytorch/pytorch/pull/165913#issuecomment-3457293475))	2025-10-28 16:11:07 +00:00
drisspg	5016e7b2eb	[FlexAttention] Add mechanism to get optimal autotune decision (#165817 ) Script: https://github.com/meta-pytorch/attention-gym/pull/169 Feels directionally okay but there is some bike shedding / this could be quite prone to collision of keys depending on mask mod and score mod changes and simple cache key. Usecase: https://github.com/meta-pytorch/attention-gym/pull/169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165817 Approved by: https://github.com/Chillee	2025-10-28 15:50:12 +00:00
Ting Lu	544b443ea1	[CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470 ) 13.0.U2 is posted, adding to nightlies Why we want to upgrade: CUDA 13.0.U2 included a new release from cuBLAS that 1. Enabled opt-in fixed-point emulation for FP64 matmuls (D/ZGEMM) which improves performance and power-efficiency. 2. Improved performance on NVIDIA [DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/) for FP16/BF16 and FP8 GEMMs. 3. adds BF16x9 FP32 emulation support for SYRK and HERK routines. Reference: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-13-0-update-2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165470 Approved by: https://github.com/atalman	2025-10-28 15:14:43 +00:00
johannes	3041ede082	Improve eig tests in preparation for new eig backends (#166322 ) ### Summary Improves validation of `torch.linalg.eig` results by verifying the eigen decomposition identity A v − v λ = 0. ### Motivation Eigenvectors are not unique, and numerical differences between backends (cuSOLVER, MAGMA, CPU) can cause false test failures. This PR replaces direct elementwise comparisons with a mathematical identity check, improving robustness across devices. ### Details - Introduces `fulfills_eigen_decomposition_identity()` in `test_eig_compare_backends()` to validate the eigen equation. - Uses CPU matmul for high-precision verification. - Handles zero-sized matrices explicitly. - Tolerances derived from numerical comparisons between cuSOLVER and NumPy. See discussion: [dev-discuss.pytorch.org link](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/6) ### Impact - Improves test stability and correctness across eig backends. - No change to public API. - All tests pass; lintrunner reports no issues. - Enables introduction of new eig backends without false test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166322 Approved by: https://github.com/lezcano	2025-10-28 14:42:47 +00:00
Sherlock Huang	34d6ef7022	Update gm.print_readable to include Annotation (#165397 ) Sample output ``` [rank0]: # Annotation: {'compile_with_inductor': 'flex_attention'} File: /data/users/bahuang/pytorch/torch/nn/attention/flex_attention.py:1490 in flex_attention, code: out, lse, max_scores = flex_attention_hop( [rank0]: score_mod_2 = self.score_mod_2 [rank0]: mask_fn_2 = self.mask_fn_2 [rank0]: flex_attention_1 = torch.ops.higher_order.flex_attention(xq_5, xk_5, xv_3, score_mod_2, (2048, 2048, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_indices, 128, 128, mask_fn_2), 0.25, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True, 'OUTPUT_MAX': False}, (), (g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___mask_mod___closure___0_cell_contents,)); xq_5 = xk_5 = xv_3 = score_mod_2 = mask_fn_2 = None [rank0]: out_2: "bf16[8, 4, 2048, 16]" = flex_attention_1[0]; flex_attention_1 = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165397 Approved by: https://github.com/yushangdi, https://github.com/anijain2305, https://github.com/mlazos	2025-10-28 13:54:38 +00:00
PyTorch MergeBot	110efe4df4	Revert "[inductor][choices] lookup table choices 1/3 (#164978 )" This reverts commit `b44423bbb4`. Reverted https://github.com/pytorch/pytorch/pull/164978 on behalf of https://github.com/atalman due to failing internal test on newly added tests: Test when there's no lookup table entry with different autotune modes ([comment](https://github.com/pytorch/pytorch/pull/164978#issuecomment-3456400126))	2025-10-28 13:12:55 +00:00
Roman Krasavtsev	e137cd0a10	docs: fix typos (#164879 ) Correct typos in the comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/164879 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/cyyever	2025-10-28 12:00:36 +00:00
Shivam Raikundalia	be28329710	[Pytorch] Update Kineto Submodule (#166317 ) Summary: Update Submodule Test Plan: CI Differential Revision: D85579130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166317 Approved by: https://github.com/Skylion007	2025-10-28 10:41:17 +00:00
Minjang Kim	85a7c745aa	[triton][nativert] Add num_cpu_threads for triton-cpu (#166255 ) Summary: The new triton-cpu has `num_cpu_threads` like `num_warps`, which are auto-tunable. This diff adds `num_cpu_threads` to NativeRT. Differential Revision: D85515240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166255 Approved by: https://github.com/XueningXu	2025-10-28 08:40:04 +00:00
William Wen	32fe4f681e	[dynamo] fix keyerror in resume_execution (again) (#166040 ) Fixes https://github.com/pytorch/pytorch/issues/166176 The error I attempted to fix in https://github.com/pytorch/pytorch/pull/162318 was still appearing internally. Surprised that this wasn't caught anywhere 😰 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166040 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166036	2025-10-28 07:04:29 +00:00
William Wen	ebb2b2e894	[dynamo] fix store attr graph break in with block (#166036 ) Fixes https://github.com/pytorch/pytorch/issues/166033 Differential Revision: [D85198055](https://our.internmc.facebook.com/intern/diff/D85198055) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166036 Approved by: https://github.com/Lucaskabela	2025-10-28 07:04:29 +00:00
KarhouTam	13413b3b07	[AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221 ) This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified. The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests. This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example: ``` File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type with self.assertWarnsRegex( AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast." ``` Sorry for the inconvenience again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221 Approved by: https://github.com/albanD	2025-10-28 06:21:29 +00:00
Shunting Zhang	5d0b3e28dc	[inductor] generate fused rms/layer norm bwd (#165370 ) RMS/Layer norm backward would generated 2 kind of reductions: - the reduction computing dx which reduce across the hidden dimension (in the context of transformer) - the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension. These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders. There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically. The generated kernel is very similar to `33924d20b6/src/liger_kernel/ops/rms_norm.py (L114)` . To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload. To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following: 1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening. 2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction 3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16). Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370 Approved by: https://github.com/jansel ghstack dependencies: #166204	2025-10-28 05:53:52 +00:00
Banit Agrawal	9139368b64	[PyTorch] Use events from pool in copy_device_to_device (#165647 ) Summary: In this diff, we add a event pool so that we dont have to create/destroy events all the time, instead re-use the events from the pool. Test Plan: contbuild Differential Revision: D84685495 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165647 Approved by: https://github.com/bbus	2025-10-28 05:19:05 +00:00
Animesh Jain	02095cc09d	[dynamo] Dont guard on getset descriptors for torch_function (#166346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166346 Approved by: https://github.com/mlazos ghstack dependencies: #166329	2025-10-28 04:33:56 +00:00
Animesh Jain	65868156c6	[dynamo] Guard selectively on the torch APIs (#166329 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166329 Approved by: https://github.com/Lucaskabela	2025-10-28 04:33:56 +00:00
Zhengxu Chen	f93ea7dab1	[export] Update dynamo_graph_capture_for_export to return GraphModule. (#166091 ) Make dynamo_graph_capture_for_export return a more compatible GraphModule object which is closer the the original behavior of dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/166091 Approved by: https://github.com/tugsbayasgalan	2025-10-28 04:23:28 +00:00
Nichols A. Romero	a77f5d9a00	[ROCm] Use a ROCm version string without hash. (#166336 ) Fixes #166068 Use the ROCm version string that does not contain a hash. The string is set in LoadHIP.cmake. Tested on repro provided by reporter. For a ROCm 7.0 docker container, we get `7.0.0`. For a ROCm 7.0.2 docker container, we get `7.0.2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166336 Approved by: https://github.com/jeffdaily	2025-10-28 03:53:55 +00:00
Janani Sriram	ff46d5a79b	[Inductor][Triton][FP8] Support deepseek-style scaling in Inductor (#164404 ) Summary: Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results. DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency: - Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`) - Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`) This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency. NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling. NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation. Test Plan: In OSS PyTorch, run ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 \| tee ~/personal/deepseek_style/deepseek_bench.log ``` Differential Revision: D83609850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404 Approved by: https://github.com/slayton58	2025-10-28 03:38:54 +00:00
William Wen	f452edd782	[dynamo, 3.14] fix misc. bugs to get most dynamo unittests passing locally in 3.14 (#164631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164631 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-28 03:24:22 +00:00
William Wen	ea698e8bfc	[dynamo, nested graph breaks] disallow nested graph breaks in HOPs (#166016 ) As discussed offline with @ydwu4, we should not allow nested graph breaks in HOPs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166016 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166013, #166015, #165808, #165809	2025-10-28 03:03:38 +00:00
William Wen	7f7a28046b	[dynamo, nested graph breaks] disable nested graph breaks in generators; enable nested_graph_breaks in test_ctx_manager.py and test_generator.py (#165809 ) Generators should not support nested graph breaks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165809 Approved by: https://github.com/Lucaskabela, https://github.com/guilhermeleobas ghstack dependencies: #166013, #166015, #165808	2025-10-28 03:03:37 +00:00
William Wen	d8283a317a	[dynamo, nested graph breaks] fix RETURN_VALUE tx skipping in nested graph breaks (#165808 ) Previously, we would completely skip building and calling any resume function if the leaf frame's resume instruction was RETURN_VALUE/RETURN_CONST. Now, we only skip building/calling resume functions for frames that are resuming on RETURN_VALUE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165808 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166013, #166015	2025-10-28 03:03:37 +00:00
William Wen	e0ca3049c0	[dynamo, nested graph breaks] remove _dynamo.utils.counter patch on inlined tx'es (#166015 ) This `patch.dict(counters, ...` appears to be ancient code that doesn't really seem to be doing anything? It causes issues in nested graph breaks because the patch cleanup clears out the record of the nested graph break. Removing the patch to see if it's even needed in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166015 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166013	2025-10-28 03:03:37 +00:00
William Wen	8417981c96	[dynamo, nested graph breaks] add TestCaseWithNestedGraphBreaks subclass (#166013 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166013 Approved by: https://github.com/Lucaskabela	2025-10-28 03:03:37 +00:00
Simon Fan	06e71c8558	[hop] local_map MoE: fix unbacked symints during tracing and symint activations order in the wrapper (#165551 ) This PR fixes 2 issues with local_mapping token-choice moe. Splits from the fw token dispatch result in tensors with unbacked shapes and these unbacked shapes are fully contained in the a2as, and should not leak outside of the joint graph. The HOP body fw and bw are expected to coerce back to static shapes (due to adding it with shared experts output) before returning. ```python routed_output: "bf16[u0 + u1 + u10 + u11 + u12 + u13 + u14 + u15 + u16 + u17 + u18 + u19 + u2 + u20 + u21 + u22 + u23 + u24 + u25 + u26 + u27 + u28 + u29 + u3 + u30 + u31 + u32 + u33 + u34 + u35 + u36 + u37 + u38 + u39 + u4 + u40 + u41 + u42 + u43 + u44 + u45 + u46 + u47 + u48 + u49 + u5 + u50 + u51 + u52 + u53 + u54 + u55 + u56 + u57 + u58 + u59 + u6 + u60 + u61 + u62 + u63 + u7 + u8 + u9, 2048]" = torch.ops.higher_order.autograd_function_apply(fwd_body_1, bwd_body_1, out_1, item, item_1, item_2, item_3, item_4, item_5, item_6, item_7, item_8, item_9, item_10, item_11, item_12, item_13, item_14, item_15, item_16, item_17, item_18, item_19, item_20, item_21, item_22, item_23, item_24, item_25, item_26, item_27, item_28, item_29, item_30, item_31, item_32, item_33, item_34, item_35, item_36, item_37, item_38, item_39, item_40, item_41, item_42, item_43, item_44, item_45, item_46, item_47, item_48, item_49, item_50, item_51, item_52, item_53, item_54, item_55, item_56, item_57, item_58, item_59, item_60, item_61, item_62, item_63, item_64, item_65, item_66, item_67, item_68, item_69, item_70, item_71, item_72, item_73, item_74, item_75, item_76, item_77, item_78, item_79, item_80, item_81, item_82, item_83, item_84, item_85, item_86, item_87, item_88, item_89, item_90, item_91, item_92, item_93, item_94, item_95, item_96, item_97, item_98, item_99, item_100, item_101, item_102, item_103, item_104, item_105, item_106, item_107, item_108, item_109, item_110, item_111, item_112, item_113, item_114, item_115, item_116, item_117, item_118, item_119, item_120, item_121, item_122, item_123, item_124, item_125, item_126, item_127, args_tensor_mask = [True, False, False, False], non_differentiable_idx = []); fwd_body_1 = bwd_body_1 = out_1 = item = item_1 = item_2 = item_3 = item_4 = item_5 = item_6 = item_7 = item_8 = item_9 = item_10 = item_11 = item_12 = item_13 = item_14 = item_15 = item_16 = item_17 = item_18 = item_19 = item_20 = item_21 = item_22 = item_23 = item_24 = item_25 = item_26 = item_27 = item_28 = item_29 = item_30 = item_31 = item_32 = item_33 = item_34 = item_35 = item_36 = item_37 = item_38 = item_39 = item_40 = item_41 = item_42 = item_43 = item_44 = item_45 = item_46 = item_47 = item_48 = item_49 = item_50 = item_51 = item_52 = item_53 = item_54 = item_55 = item_56 = item_57 = item_58 = item_59 = item_60 = item_61 = item_62 = item_63 = item_64 = item_65 = item_66 = item_67 = item_68 = item_69 = item_70 = item_71 = item_72 = item_73 = item_74 = item_75 = item_76 = item_77 = item_78 = item_79 = item_80 = item_81 = item_82 = item_83 = item_84 = item_85 = item_86 = item_87 = item_88 = item_89 = item_90 = item_91 = item_92 = item_93 = item_94 = item_95 = item_96 = item_97 = item_98 = item_99 = item_100 = item_101 = item_102 = item_103 = item_104 = item_105 = item_106 = item_107 = item_108 = item_109 = item_110 = item_111 = item_112 = item_113 = item_114 = item_115 = item_116 = item_117 = item_118 = item_119 = item_120 = item_121 = item_122 = item_123 = item_124 = item_125 = item_126 = item_127 = None # File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:777 in local_mapped_region, code: torch._check(routed_output.shape[0] == shape[0] * shape[1]) size_3 = routed_output.size() getitem_139 = size_3[1]; size_3 = getitem_139 = None # File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:779 in local_mapped_region, code: routed_output = routed_output.view(shape) routed_output_1: "bf16[4, 6144, 2048]" = routed_output.view((4, 6144, 2048)); routed_output = None # File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:781 in local_mapped_region, code: out = out.scatter_add(dim=1, index=token_indices_experts_sorted, src=routed_output) out_3: "bf16[4, 1024, 2048]" = out_2.scatter_add(dim = 1, index = token_indices_experts_sorted_2, src = routed_output_1); out_2 = token_indices_experts_sorted_2 = routed_output_1 = None ``` ## 1. Unbacked symints contained within the HOP body Based on `9b2974e812` and `36030e0315`. We disable proxy mode so that unbacked symints that are contained within the HOP subgraph aren't proxied: ```python [rank0]: RuntimeError: u576 + u577 + u578 + u579 + u580 + u581 + u582 + u583 + u584 + u585 + u586 + u587 + u588 + u589 + u590 + u591 + u592 + u593 + u594 + u595 + u596 + u597 + u598 + u599 + u600 + u601 + u602 + u603 + u604 + u605 + u606 + u607 + u608 + u609 + u610 + u611 + u612 + u613 + u614 + u615 + u616 + u617 + u618 + u619 + u620 + u621 + u622 + u623 + u624 + u625 + u626 + u627 + u628 + u629 + u630 + u631 + u632 + u633 + u634 + u635 + u636 + u637 + u638 + u639 + 1 (140667108386064)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7fef9d44f950> ``` And we ensure that no unbacked symints leak outside of the region. ## 2. Saved symint activations local_map is using the partitioned backward, and needs to follow the partitioner's desired ordering, this is the same order as AOTAutograd runtime wrapper uses in `_backward_prologue_functional` where we pass symints first: `d2c82bafb7/torch/_functorch/_aot_autograd/runtime_wrappers.py (L1702-L1704)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165551 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh ghstack dependencies: #164780	2025-10-28 02:52:41 +00:00
Simon Fan	a76b59cc45	[dynamo] local_map error message for reordered inputs (#164780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164780 Approved by: https://github.com/mlazos	2025-10-28 02:52:41 +00:00
PyTorch MergeBot	74336f8c77	Revert "[CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470 )" This reverts commit `5e769ff867`. Reverted https://github.com/pytorch/pytorch/pull/165470 on behalf of https://github.com/atalman due to Sorry reverting for now, to restore trunk health ([comment](https://github.com/pytorch/pytorch/pull/165470#issuecomment-3454166879))	2025-10-28 02:21:48 +00:00
Shangdi Yu	236ce736a1	[reland] Add provenance to inductor IR nodes created after graph.run (#164255 ) (#164746 ) Summary: as title - Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes. - Extract `assign_origin_node` function Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_deferred_triton_kernels ``` Differential Revision: D83979975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746 Approved by: https://github.com/mlazos	2025-10-28 02:20:20 +00:00
Yingji Zhang	17bdb232e1	[GR v0] AOTI Enablement - Fix GR model AOTI inplace update by skipping empty named (#165970 ) (#166037 ) Summary: Add a gflag to allow us skip empty constant named parameter during dense loading. In [vm_parameters.py](https://fburl.com/code/7xr9ihwy), there is a constant _empty_tensor parameter used for the model. This constant parameter is skipped in XL weights during model publish because it is empty. This will break model inplace update later because it will be reported by the AOTI container but cannot be found from the model merge weights. This diff will allow us to solve the problem. Test Plan: Verified inplace update in job https://www.internalfb.com/vanguard/serving_test_cases/1165842932095688 Reviewed By: muchulee8, joannec3634 Differential Revision: D85082330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166037 Approved by: https://github.com/muchulee8, https://github.com/jcwchen	2025-10-28 01:50:36 +00:00
Nikita Shulga	add37bacda	[MPS] Better error checking for FFT ops (#166272 ) Namely, error out rather than crash when out dtype is of an unexpected type Resize output tensor to the expected size in `_out` operation, to prevent crash when tensor of an unexpected size is passed. Preserve symbolic shapes whenever possible Test plan: Run `python test_ops.py -v -k test_out_warning_fft_hfft_mps` for MPS device, without this change it crashes with `Error: Invalid KernelDAG, equalShape for destination failed'`, run `python ../test/test_ops.py -v -k test_dtypes_stft_mps`, without this change it crashes with `A complex mlir::Type does not have a corresponding complex MPSDataType"`, when input dtype is bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166272 Approved by: https://github.com/kulinseth	2025-10-28 01:31:47 +00:00
karthickai	1425b40f29	[inductor] Fix argmin/argmax returning incorrect indices for non-contiguous tensor (#165983 ) Fixes #163929 Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors. When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983 Approved by: https://github.com/shunting314	2025-10-28 01:23:24 +00:00
bobrenjc93	8af9ed0824	[torchfuzz] split, chunk, stack, cat, expand, gather, cumsum, clamp, index_select, split (#166221 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166221 Approved by: https://github.com/pianpwk ghstack dependencies: #166187, #166188, #166220, #166189, #166190	2025-10-28 01:21:07 +00:00
bobrenjc93	7045aab143	[torchfuzz] add mhaf operator (#166190 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166190 Approved by: https://github.com/pianpwk ghstack dependencies: #166187, #166188, #166220, #166189	2025-10-28 01:21:07 +00:00
bobrenjc93	7ae8aaf4c0	[torchfuzz] add sdpa operator (#166189 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166189 Approved by: https://github.com/pianpwk ghstack dependencies: #166187, #166188, #166220	2025-10-28 01:20:58 +00:00
bobrenjc93	f2450798cd	[torchfuzz] make pointwise subclasses defined torch_op_name (#166220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166220 Approved by: https://github.com/pianpwk ghstack dependencies: #166187, #166188	2025-10-28 01:08:34 +00:00
fduwjj	46d17e8871	[Symm mem] Add a unit test for mempool tensor with dist collective (#166206 ) We haven't tried to see if tensors on nvshmem calling c10d collectives work or not. This PR is adding a show case for it inside UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166206 Approved by: https://github.com/ngimel	2025-10-28 00:41:47 +00:00
Shunting Zhang	dc011d3203	[inductor][ez] add overridable env var for disabling fx graph cache (#166138 ) I set TORCHINDUCTOR_FX_GRAPH_CACHE=0 a lot to make sure the compilation happens by disabling fx graph caching. I even put this in my .bashrc. But this cause a simple vllm script fail: https://gist.github.com/shunting314/4253b2b5ab5e7d1b0fc9516c84054904 Error log: https://gist.github.com/shunting314/1d04bbeb58bc486f975684f56d65615d The root cause is, 1. vllm patch inductor_config.fx_graph_cache to True here: `e255d92990/vllm/compilation/compiler_interface.py (L308)` The code in vllm relies fx graph cache is on (unless VLLM_DISABLE_COMPILE_CACHE is overriden to false) 2. setting TORCHINDUCTOR_FX_GRAPH_CACHE=0 will cause inductor_config.fx_graph_cache not overridable. I add TORCHINDUCTOR_FX_GRAPH_CACHE_DEFAULT so that we can still use it to skip fx graph cache while still allow project like vllm to override it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166138 Approved by: https://github.com/eellison	2025-10-28 00:27:19 +00:00

1 2 3 4 5 ...

95048 Commits