Commit Graph

95048 Commits

Author SHA1 Message Date
atalman
a25818cf7e Fix image display on pypi project description section (#166404)
Fixes https://github.com/pytorch/pytorch/issues/165559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166404
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/Camyll
2025-10-28 18:58:24 +00:00
Elana
e3e93c7107 [MPS] Fix random in-place ops on non-contiguous tensors (#165267)
Random in-place operations (normal_, uniform_, exponential_, bernoulli_, random_) were silently failing on non-contiguous tensors on macOS < 15.0.

* Added needsGather check and scatter-back logic to handle non-contiguous output tensors, following the pattern used in PointwiseOps.

* Adds test to confirm these now work
* Remove pre-macOS15 xfail for test_Dropout

Fixes #165257 and #124029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165267
Approved by: https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-28 18:43:22 +00:00
Nikita Shulga
1abfa5f70b [EZ][MPS] Improve distribution error checking (#166425)
Essentially not allow ops on self-overlapping outputs, by adding
`at::assert_no_internal_overlap(self);` check that already used in CPU
and CUDA builds, see
895795f07c/aten/src/ATen/native/DistributionTemplates.h (L366)

This fixes `test_error_inputs_bernoulli_mps`

Should be landed ahead of https://github.com/pytorch/pytorch/pull/165267
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166425
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2025-10-28 18:42:12 +00:00
Bin Bao
687c15c0b3 [AOTI][BE] Change test_aoti_inference to one-pass build (#164277)
Summary: To fix https://github.com/pytorch/pytorch/issues/159400. Currently, test_aoti_abi_check and test_aoti_inference need to be built in two passes, first build pytorch using the regular `pythonsetup.py develop` and then build with `CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 python setup.py devleop`. This is cumbersome. Fix by rewriting CMakeLists.txt for test_aoti_inference to one-pass build which runs AOTI to compile models at the test time. Also update CI test script to get rid of two-pass build. For test_aoti_abi_check, it is not AOTI specific, so we make it not guarded by BUILD_AOT_INDUCTOR_TEST.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164277
Approved by: https://github.com/janeyx99
2025-10-28 17:43:22 +00:00
Jeff Daily
895795f07c [ROCm][CI] forward fix kineto submodule bump (#166421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166421
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-28 17:40:23 +00:00
Aaron Orenstein
2dc56456cb refactor: pull _replace_node common functionality out of Scheduler.finalize_multi_template_buffers (#163368)
Pull replace_node function out of Scheduler.finalize_multi_template_buffers(). This is needed by the next PR (#163369). As part of this also pull the _replace_operation_buffer() up to top-level since it needed no self references.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163368
Approved by: https://github.com/PaulZhang12
2025-10-28 17:21:52 +00:00
Edward Yang
8110ce02a2 Add a skill for writing skills (#166266)
Apparently, if you just ask Claude to write a skill it doesn't follow the
correct rules.  So this one is just the official docs for skills.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166266
Approved by: https://github.com/Skylion007
ghstack dependencies: #166265
2025-10-28 16:49:27 +00:00
Edward Yang
43c30f607e Use correct layout convention for skills (#166265)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166265
Approved by: https://github.com/Skylion007
2025-10-28 16:49:27 +00:00
Simon Layton
5ebf74a655 [2/2] Move scaled_mm routines to their own file (#166314)
Summary:

* Further simplify `ATen/native/cuda/Blas.cpp` by moving `_scaled_mm`,
  `_scaled_mm_v2` and supporting methods to a new file,
  `ATen/native/cuda/ScaledBlas.cpp`

Test Plan:

```
pytest -svv test/test_matmul_cuda.py
pytest -svv test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166314
Approved by: https://github.com/eqy
ghstack dependencies: #166313
2025-10-28 16:35:32 +00:00
Simon Layton
acd936cc1a [1/2] Split cublasCommonArgs into its own file (#166313)
Summary:

* Factor out `cublasCommonArgs` struct
* Necessary for factoring out scaled mm routines

Test Plan:

```
pytest -svv test/test_matmul_cuda.py
pytest -svv test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166313
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-10-28 16:35:32 +00:00
PyTorch MergeBot
a4a0378e6b Revert "[cuDNN] Smoke-test runtime cuDNN version matches compile time version in CI (#165922)"
This reverts commit 2a5f87decf.

Reverted https://github.com/pytorch/pytorch/pull/165922 on behalf of https://github.com/atalman due to cudnn update started to fail, see https://github.com/pytorch/pytorch/pull/165913#issuecomment-3457293475 ([comment](https://github.com/pytorch/pytorch/pull/165922#issuecomment-3457389406))
2025-10-28 16:29:29 +00:00
Prachi Gupta
ac841267a1 [ROCm] skip AsyncTP test class as AsyncTP is not supported on ROCm (#166316)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166316
Approved by: https://github.com/jeffdaily
2025-10-28 16:23:46 +00:00
PyTorch MergeBot
0eacd934bc Revert "Update cuDNN 9.10.2 in Manylinux 2.28 Docker files (#165913)"
This reverts commit 840d63c12d.

Reverted https://github.com/pytorch/pytorch/pull/165913 on behalf of https://github.com/clee2000 due to I think something here is causing CI tests to segfault at exit on cuda, ex [GH job link](https://github.com/pytorch/pytorch/actions/runs/18857880394/job/53811917713) [HUD commit link](9a91486e45) says no tests failed but it segfaulted afterwards.  I can't tell if it's because of this change, or an unpinned dependency in docker that got triggered by this.  Note to self, would have been bad TD except trunk didn't run either ([comment](https://github.com/pytorch/pytorch/pull/165913#issuecomment-3457293475))
2025-10-28 16:11:07 +00:00
drisspg
5016e7b2eb [FlexAttention] Add mechanism to get optimal autotune decision (#165817)
Script: https://github.com/meta-pytorch/attention-gym/pull/169

Feels directionally okay but there is some bike shedding / this could be quite prone to collision of keys depending on mask mod and score mod changes and simple cache key.

Usecase: https://github.com/meta-pytorch/attention-gym/pull/169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165817
Approved by: https://github.com/Chillee
2025-10-28 15:50:12 +00:00
Ting Lu
544b443ea1 [CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470)
13.0.U2 is posted, adding to nightlies
Why we want to upgrade: CUDA 13.0.U2 included a new release from cuBLAS that
1. Enabled opt-in fixed-point emulation for FP64 matmuls (D/ZGEMM) which improves performance and power-efficiency.
2. Improved performance on NVIDIA [DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/) for FP16/BF16 and FP8 GEMMs.
3. adds BF16x9 FP32 emulation support for SYRK and HERK routines.
Reference: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-13-0-update-2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165470
Approved by: https://github.com/atalman
2025-10-28 15:14:43 +00:00
johannes
3041ede082 Improve eig tests in preparation for new eig backends (#166322)
### Summary
Improves validation of `torch.linalg.eig` results by verifying the eigen decomposition identity **A v − v λ = 0**.

### Motivation
Eigenvectors are not unique, and numerical differences between backends (cuSOLVER, MAGMA, CPU)
can cause false test failures. This PR replaces direct elementwise comparisons with a mathematical
identity check, improving robustness across devices.

### Details
- Introduces `fulfills_eigen_decomposition_identity()` in `test_eig_compare_backends()` to validate the eigen equation.
- Uses CPU matmul for high-precision verification.
- Handles zero-sized matrices explicitly.
- Tolerances derived from numerical comparisons between cuSOLVER and NumPy.
  See discussion: [dev-discuss.pytorch.org link](https://dev-discuss.pytorch.org/t/cusolver-dnxgeev-faster-cuda-eigenvalue-calculations/3248/6)

### Impact
- Improves test stability and correctness across eig backends.
- No change to public API.
- All tests pass; lintrunner reports no issues.
- Enables introduction of new eig backends without false test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166322
Approved by: https://github.com/lezcano
2025-10-28 14:42:47 +00:00
Sherlock Huang
34d6ef7022 Update gm.print_readable to include Annotation (#165397)
Sample output
```
[rank0]:        # Annotation: {'compile_with_inductor': 'flex_attention'} File: /data/users/bahuang/pytorch/torch/nn/attention/flex_attention.py:1490 in flex_attention, code: out, lse, max_scores = flex_attention_hop(
[rank0]:        score_mod_2 = self.score_mod_2
[rank0]:        mask_fn_2 = self.mask_fn_2
[rank0]:        flex_attention_1 = torch.ops.higher_order.flex_attention(xq_5, xk_5, xv_3, score_mod_2, (2048, 2048, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_kv_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___q_indices, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_num_blocks, g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___full_q_indices, 128, 128, mask_fn_2), 0.25, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True, 'OUTPUT_MAX': False}, (), (g____import_torchtitan_dot_models_dot_attention___flex_attention_block_masks___block_causal___none___mask_mod___closure___0_cell_contents,));  xq_5 = xk_5 = xv_3 = score_mod_2 = mask_fn_2 = None
[rank0]:        out_2: "bf16[8, 4, 2048, 16]" = flex_attention_1[0];  flex_attention_1 = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165397
Approved by: https://github.com/yushangdi, https://github.com/anijain2305, https://github.com/mlazos
2025-10-28 13:54:38 +00:00
PyTorch MergeBot
110efe4df4 Revert "[inductor][choices] lookup table choices 1/3 (#164978)"
This reverts commit b44423bbb4.

Reverted https://github.com/pytorch/pytorch/pull/164978 on behalf of https://github.com/atalman due to failing internal test on newly added tests: Test when there's no lookup table entry with different autotune modes ([comment](https://github.com/pytorch/pytorch/pull/164978#issuecomment-3456400126))
2025-10-28 13:12:55 +00:00
Roman Krasavtsev
e137cd0a10 docs: fix typos (#164879)
Correct typos in the comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164879
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/cyyever
2025-10-28 12:00:36 +00:00
Shivam Raikundalia
be28329710 [Pytorch] Update Kineto Submodule (#166317)
Summary: Update Submodule

Test Plan: CI

Differential Revision: D85579130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166317
Approved by: https://github.com/Skylion007
2025-10-28 10:41:17 +00:00
Minjang Kim
85a7c745aa [triton][nativert] Add num_cpu_threads for triton-cpu (#166255)
Summary:
The new triton-cpu has `num_cpu_threads` like `num_warps`, which are auto-tunable. This diff adds `num_cpu_threads` to NativeRT.

Differential Revision: D85515240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166255
Approved by: https://github.com/XueningXu
2025-10-28 08:40:04 +00:00
William Wen
32fe4f681e [dynamo] fix keyerror in resume_execution (again) (#166040)
Fixes https://github.com/pytorch/pytorch/issues/166176

The error I attempted to fix in https://github.com/pytorch/pytorch/pull/162318 was still appearing internally.

Surprised that this wasn't caught anywhere 😰

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166040
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166036
2025-10-28 07:04:29 +00:00
William Wen
ebb2b2e894 [dynamo] fix store attr graph break in with block (#166036)
Fixes https://github.com/pytorch/pytorch/issues/166033

Differential Revision: [D85198055](https://our.internmc.facebook.com/intern/diff/D85198055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166036
Approved by: https://github.com/Lucaskabela
2025-10-28 07:04:29 +00:00
KarhouTam
13413b3b07 [AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221)
This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified.

**The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests.**

This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example:
```
File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type
    with self.assertWarnsRegex(
        AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast."
```

Sorry for the inconvenience again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221
Approved by: https://github.com/albanD
2025-10-28 06:21:29 +00:00
Shunting Zhang
5d0b3e28dc [inductor] generate fused rms/layer norm bwd (#165370)
RMS/Layer norm backward would generated 2 kind of reductions:
- the reduction computing dx which reduce across the hidden dimension (in the context of transformer)
- the reduction computing dw/db which reduce across the BxT (batch size , sequence length) dimension.

These 2 set of reductions have common input buffers but inductor can not fuse them because of different loop orders.

There are multiple sources of custom kernels that implement fused version of such kernel (Liger-Kernel, quack, Paul Zhang's internal post). This PR enable Inductor to generate such kernels automatically.

The generated kernel is very similar to 33924d20b6/src/liger_kernel/ops/rms_norm.py (L114) .

To make the implementation simple and performing, we enable such fusion only if the inner reduction (computing dx) is a persistent reduction. This should be true for representative inputs. Persistent reduction is critical for the perf here to make sure a loaded tensor does not need to be reload.

To make sure the inner reduction (computing dx) and outer reductions (computing dw/db) being fusible, the PR does the following:
1. convert the outer reductions to pointwise by replacing 'reduction' & 'store_reduction' node with a new type of node 'parital_accumulate'. The new node will collect the reduction type, buffer name, input of reduction etc, which is essential for proper codegening.
2. do loop reordering (rely on the earlier loop ordering after fusion work) to reorder the loops of the converted pointwise so it can be fused with the inner reduction
3. there can be epilogues that need to be added in the end. E.g. the outer reduction may be followed by a division for mean , or followed by a down cast if dw/db is in low precision (fp16/bf16).

Some early benchmarking on H100 shows about 2X speedup for both RMSNorm and LayerNorm backward for shape (1152 * 500, 384 ) used in some internal model. Note that, I manually disable split reduction in this benchmarking since otherwise the fusion will be skipped right now. The next PR will make the mix-order-reduction compose better with split reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165370
Approved by: https://github.com/jansel
ghstack dependencies: #166204
2025-10-28 05:53:52 +00:00
Banit Agrawal
9139368b64 [PyTorch] Use events from pool in copy_device_to_device (#165647)
Summary: In this diff, we add a event pool so that we dont have to create/destroy events all the time, instead re-use the events from the pool.

Test Plan: contbuild

Differential Revision: D84685495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165647
Approved by: https://github.com/bbus
2025-10-28 05:19:05 +00:00
Animesh Jain
02095cc09d [dynamo] Dont guard on getset descriptors for torch_function (#166346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166346
Approved by: https://github.com/mlazos
ghstack dependencies: #166329
2025-10-28 04:33:56 +00:00
Animesh Jain
65868156c6 [dynamo] Guard selectively on the torch APIs (#166329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166329
Approved by: https://github.com/Lucaskabela
2025-10-28 04:33:56 +00:00
Zhengxu Chen
f93ea7dab1 [export] Update dynamo_graph_capture_for_export to return GraphModule. (#166091)
Make dynamo_graph_capture_for_export return a more compatible GraphModule object which is closer the the original behavior of dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166091
Approved by: https://github.com/tugsbayasgalan
2025-10-28 04:23:28 +00:00
Nichols A. Romero
a77f5d9a00 [ROCm] Use a ROCm version string without hash. (#166336)
Fixes #166068

Use the ROCm version string that does not contain a hash. The string is set in LoadHIP.cmake.

Tested on repro provided by reporter.

For a ROCm 7.0 docker container, we get `7.0.0`.

For a ROCm 7.0.2 docker container, we get `7.0.2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166336
Approved by: https://github.com/jeffdaily
2025-10-28 03:53:55 +00:00
Janani Sriram
ff46d5a79b [Inductor][Triton][FP8] Support deepseek-style scaling in Inductor (#164404)
Summary:
Support deepseek-style scaling in Inductor Triton for FP8 GEMMs. DeepSeek-style scaling is a colloquial term for a fine-grained mixed precision framework using FP8 to train [Deepseek-V3](https://arxiv.org/pdf/2412.19437), DeepSeek AI's recent MoE (Mixture of Experts) model. DeepSeek-style scaling effectively extends the dynamic range of FP8 by mitigating dequantization overhead under increased-precision accumulation, which is key to achieving more accurate FP8 GEMM results.

DeepSeek-style scaling on matmul `A @ B` leverages two different types of scaling strategies to preserve a balance between numerical stability and training efficiency:
- Activations (input tensor `A`): tile-wise (1x128 across shape `(M, K)`)
- Weights (input tensor `B`): block-wise (128x128 across shape `(N, K)`)

This diff enables Inductor users to replicate past successes with deepseek-style scaling and achieve higher numerical stability while increasing training efficiency.

NOTE: Block-wise 128x128 scaling is only supported in CUDA 12.9+; therefore, deepseek-style scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run deepseek-style scaling.

NOTE: Accuracy for FP8 is unstable, even with high tolerances, which is why TritonBench benchmarks are unlikely to be accurate against a `torch` implementation.

Test Plan:
In OSS PyTorch, run
```
TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 4096 --n 768 --k 512 --output="{output_dir}/deepseek_bench.csv" --scaling_deepseek --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/deepseek_style/deepseek_bench.log
```

Differential Revision: D83609850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164404
Approved by: https://github.com/slayton58
2025-10-28 03:38:54 +00:00
William Wen
f452edd782 [dynamo, 3.14] fix misc. bugs to get most dynamo unittests passing locally in 3.14 (#164631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164631
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-10-28 03:24:22 +00:00
William Wen
ea698e8bfc [dynamo, nested graph breaks] disallow nested graph breaks in HOPs (#166016)
As discussed offline with @ydwu4, we should not allow nested graph breaks in HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166016
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013, #166015, #165808, #165809
2025-10-28 03:03:38 +00:00
William Wen
7f7a28046b [dynamo, nested graph breaks] disable nested graph breaks in generators; enable nested_graph_breaks in test_ctx_manager.py and test_generator.py (#165809)
Generators should not support nested graph breaks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165809
Approved by: https://github.com/Lucaskabela, https://github.com/guilhermeleobas
ghstack dependencies: #166013, #166015, #165808
2025-10-28 03:03:37 +00:00
William Wen
d8283a317a [dynamo, nested graph breaks] fix RETURN_VALUE tx skipping in nested graph breaks (#165808)
Previously, we would completely skip building and calling any resume function if the leaf frame's resume instruction was RETURN_VALUE/RETURN_CONST. Now, we only skip building/calling resume functions for frames that are resuming on RETURN_VALUE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165808
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013, #166015
2025-10-28 03:03:37 +00:00
William Wen
e0ca3049c0 [dynamo, nested graph breaks] remove _dynamo.utils.counter patch on inlined tx'es (#166015)
This `patch.dict(counters, ...` appears to be ancient code that doesn't really seem to be doing anything? It causes issues in nested graph breaks because the patch cleanup clears out the record of the nested graph break. Removing the patch to see if it's even needed in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166015
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166013
2025-10-28 03:03:37 +00:00
William Wen
8417981c96 [dynamo, nested graph breaks] add TestCaseWithNestedGraphBreaks subclass (#166013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166013
Approved by: https://github.com/Lucaskabela
2025-10-28 03:03:37 +00:00
Simon Fan
06e71c8558 [hop] local_map MoE: fix unbacked symints during tracing and symint activations order in the wrapper (#165551)
This PR fixes 2 issues with local_mapping token-choice moe. Splits from the fw token dispatch result in tensors with unbacked shapes and these unbacked shapes are fully contained in the a2as, and should not leak outside of the joint graph. The HOP body fw and bw are expected to coerce back to static shapes (due to adding it with shared experts output) before returning.
```python
routed_output: "bf16[u0 + u1 + u10 + u11 + u12 + u13 + u14 + u15 + u16 + u17 + u18 + u19 + u2 + u20 + u21 + u22 + u23 + u24 + u25 + u26 + u27 + u28 + u29 + u3 + u30 + u31 + u32 + u33 + u34 + u35 + u36 + u37 + u38 + u39 + u4 + u40 + u41 + u42 + u43 + u44 + u45 + u46 + u47 + u48 + u49 + u5 + u50 + u51 + u52 + u53 + u54 + u55 + u56 + u57 + u58 + u59 + u6 + u60 + u61 + u62 + u63 + u7 + u8 + u9, 2048]" = torch.ops.higher_order.autograd_function_apply(fwd_body_1, bwd_body_1, out_1, item, item_1, item_2, item_3, item_4, item_5, item_6, item_7, item_8, item_9, item_10, item_11, item_12, item_13, item_14, item_15, item_16, item_17, item_18, item_19, item_20, item_21, item_22, item_23, item_24, item_25, item_26, item_27, item_28, item_29, item_30, item_31, item_32, item_33, item_34, item_35, item_36, item_37, item_38, item_39, item_40, item_41, item_42, item_43, item_44, item_45, item_46, item_47, item_48, item_49, item_50, item_51, item_52, item_53, item_54, item_55, item_56, item_57, item_58, item_59, item_60, item_61, item_62, item_63, item_64, item_65, item_66, item_67, item_68, item_69, item_70, item_71, item_72, item_73, item_74, item_75, item_76, item_77, item_78, item_79, item_80, item_81, item_82, item_83, item_84, item_85, item_86, item_87, item_88, item_89, item_90, item_91, item_92, item_93, item_94, item_95, item_96, item_97, item_98, item_99, item_100, item_101, item_102, item_103, item_104, item_105, item_106, item_107, item_108, item_109, item_110, item_111, item_112, item_113, item_114, item_115, item_116, item_117, item_118, item_119, item_120, item_121, item_122, item_123, item_124, item_125, item_126, item_127, args_tensor_mask = [True, False, False, False], non_differentiable_idx = []);  fwd_body_1 = bwd_body_1 = out_1 = item = item_1 = item_2 = item_3 = item_4 = item_5 = item_6 = item_7 = item_8 = item_9 = item_10 = item_11 = item_12 = item_13 = item_14 = item_15 = item_16 = item_17 = item_18 = item_19 = item_20 = item_21 = item_22 = item_23 = item_24 = item_25 = item_26 = item_27 = item_28 = item_29 = item_30 = item_31 = item_32 = item_33 = item_34 = item_35 = item_36 = item_37 = item_38 = item_39 = item_40 = item_41 = item_42 = item_43 = item_44 = item_45 = item_46 = item_47 = item_48 = item_49 = item_50 = item_51 = item_52 = item_53 = item_54 = item_55 = item_56 = item_57 = item_58 = item_59 = item_60 = item_61 = item_62 = item_63 = item_64 = item_65 = item_66 = item_67 = item_68 = item_69 = item_70 = item_71 = item_72 = item_73 = item_74 = item_75 = item_76 = item_77 = item_78 = item_79 = item_80 = item_81 = item_82 = item_83 = item_84 = item_85 = item_86 = item_87 = item_88 = item_89 = item_90 = item_91 = item_92 = item_93 = item_94 = item_95 = item_96 = item_97 = item_98 = item_99 = item_100 = item_101 = item_102 = item_103 = item_104 = item_105 = item_106 = item_107 = item_108 = item_109 = item_110 = item_111 = item_112 = item_113 = item_114 = item_115 = item_116 = item_117 = item_118 = item_119 = item_120 = item_121 = item_122 = item_123 = item_124 = item_125 = item_126 = item_127 = None

# File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:777 in local_mapped_region, code: torch._check(routed_output.shape[0] == shape[0] * shape[1])
size_3 = routed_output.size()
getitem_139 = size_3[1];  size_3 = getitem_139 = None

# File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:779 in local_mapped_region, code: routed_output = routed_output.view(shape)
routed_output_1: "bf16[4, 6144, 2048]" = routed_output.view((4, 6144, 2048));  routed_output = None

# File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:781 in local_mapped_region, code: out = out.scatter_add(dim=1, index=token_indices_experts_sorted, src=routed_output)
out_3: "bf16[4, 1024, 2048]" = out_2.scatter_add(dim = 1, index = token_indices_experts_sorted_2, src = routed_output_1);  out_2 = token_indices_experts_sorted_2 = routed_output_1 = None
```

## 1. Unbacked symints contained within the HOP body

Based on 9b2974e812 and 36030e0315.

We disable proxy mode so that unbacked symints that are contained within the HOP subgraph aren't proxied:
```python
[rank0]: RuntimeError: u576 + u577 + u578 + u579 + u580 + u581 + u582 + u583 + u584 + u585 + u586 + u587 + u588 + u589 + u590 + u591 + u592 + u593 + u594 + u595 + u596 + u597 + u598 + u599 + u600 + u601 + u602 + u603 + u604 + u605 + u606 + u607 + u608 + u609 + u610 + u611 + u612 + u613 + u614 + u615 + u616 + u617 + u618 + u619 + u620 + u621 + u622 + u623 + u624 + u625 + u626 + u627 + u628 + u629 + u630 + u631 + u632 + u633 + u634 + u635 + u636 + u637 + u638 + u639 + 1 (140667108386064)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7fef9d44f950>
```
And we ensure that no unbacked symints leak outside of the region.

## 2. Saved symint activations

local_map is using the partitioned backward, and needs to follow the partitioner's desired ordering, this is the same order as AOTAutograd runtime wrapper uses in `_backward_prologue_functional` where we pass symints first: d2c82bafb7/torch/_functorch/_aot_autograd/runtime_wrappers.py (L1702-L1704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165551
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
ghstack dependencies: #164780
2025-10-28 02:52:41 +00:00
Simon Fan
a76b59cc45 [dynamo] local_map error message for reordered inputs (#164780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164780
Approved by: https://github.com/mlazos
2025-10-28 02:52:41 +00:00
PyTorch MergeBot
74336f8c77 Revert "[CD] Upgrade to CUDA 13.0.2 for nightly binaries (#165470)"
This reverts commit 5e769ff867.

Reverted https://github.com/pytorch/pytorch/pull/165470 on behalf of https://github.com/atalman due to Sorry reverting for now, to restore trunk health ([comment](https://github.com/pytorch/pytorch/pull/165470#issuecomment-3454166879))
2025-10-28 02:21:48 +00:00
Shangdi Yu
236ce736a1 [reland] Add provenance to inductor IR nodes created after graph.run (#164255) (#164746)
Summary:

as title

- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.

- Extract `assign_origin_node` function

Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r  test_deferred_triton_kernels
```

Differential Revision: D83979975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746
Approved by: https://github.com/mlazos
2025-10-28 02:20:20 +00:00
Yingji Zhang
17bdb232e1 [GR v0] AOTI Enablement - Fix GR model AOTI inplace update by skipping empty named (#165970) (#166037)
Summary:

Add a gflag to allow us skip empty constant named parameter during
dense loading. In [vm_parameters.py](https://fburl.com/code/7xr9ihwy), there is
a constant _empty_tensor parameter used for the model. This constant parameter
is skipped in XL weights during model publish because it is empty. This will
break model inplace update later because it will be reported by the AOTI
container but cannot be found from the model merge weights. This diff will
allow us to solve the problem.

Test Plan: Verified inplace update in job https://www.internalfb.com/vanguard/serving_test_cases/1165842932095688

Reviewed By: muchulee8, joannec3634

Differential Revision: D85082330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166037
Approved by: https://github.com/muchulee8, https://github.com/jcwchen
2025-10-28 01:50:36 +00:00
Nikita Shulga
add37bacda [MPS] Better error checking for FFT ops (#166272)
Namely, error out rather than crash when out dtype is of an unexpected type
Resize output tensor to the expected size in `_out` operation, to prevent crash when tensor of an unexpected size is passed.
Preserve symbolic shapes whenever possible

Test plan: Run `python test_ops.py -v -k test_out_warning_fft_hfft_mps` for MPS device, without this change it crashes with `Error: Invalid KernelDAG, equalShape for destination failed'`, run `python ../test/test_ops.py -v -k test_dtypes_stft_mps`, without this change it crashes with `A complex mlir::Type does not have a corresponding complex MPSDataType"`, when input dtype is bfloat16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166272
Approved by: https://github.com/kulinseth
2025-10-28 01:31:47 +00:00
karthickai
1425b40f29 [inductor] Fix argmin/argmax returning incorrect indices for non-contiguous tensor (#165983)
Fixes #163929

Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors.  When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983
Approved by: https://github.com/shunting314
2025-10-28 01:23:24 +00:00
bobrenjc93
8af9ed0824 [torchfuzz] split, chunk, stack, cat, expand, gather, cumsum, clamp, index_select, split (#166221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166221
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220, #166189, #166190
2025-10-28 01:21:07 +00:00
bobrenjc93
7045aab143 [torchfuzz] add mhaf operator (#166190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166190
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220, #166189
2025-10-28 01:21:07 +00:00
bobrenjc93
7ae8aaf4c0 [torchfuzz] add sdpa operator (#166189)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166189
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188, #166220
2025-10-28 01:20:58 +00:00
bobrenjc93
f2450798cd [torchfuzz] make pointwise subclasses defined torch_op_name (#166220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166220
Approved by: https://github.com/pianpwk
ghstack dependencies: #166187, #166188
2025-10-28 01:08:34 +00:00
fduwjj
46d17e8871 [Symm mem] Add a unit test for mempool tensor with dist collective (#166206)
We haven't tried to see if tensors on nvshmem calling c10d collectives work or not. This PR is adding a show case for it inside UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166206
Approved by: https://github.com/ngimel
2025-10-28 00:41:47 +00:00
Shunting Zhang
dc011d3203 [inductor][ez] add overridable env var for disabling fx graph cache (#166138)
I set TORCHINDUCTOR_FX_GRAPH_CACHE=0 a lot to make sure the compilation
happens by disabling fx graph caching. I even put this in my .bashrc.
But this cause a simple vllm script fail:
https://gist.github.com/shunting314/4253b2b5ab5e7d1b0fc9516c84054904

Error log:
https://gist.github.com/shunting314/1d04bbeb58bc486f975684f56d65615d

The root cause is,
1. vllm patch inductor_config.fx_graph_cache to True here:
   e255d92990/vllm/compilation/compiler_interface.py (L308)

   The code in vllm relies fx graph cache is on (unless
   VLLM_DISABLE_COMPILE_CACHE is overriden to false)
2. setting TORCHINDUCTOR_FX_GRAPH_CACHE=0 will cause
   inductor_config.fx_graph_cache not overridable.

I add TORCHINDUCTOR_FX_GRAPH_CACHE_DEFAULT so that we can still use it to skip fx
graph cache while still allow project like vllm to override it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166138
Approved by: https://github.com/eellison
2025-10-28 00:27:19 +00:00