Commit Graph

94919 Commits

Author SHA1 Message Date
Maggie Moss
eb83c3ca23 Clean up unused Pyrefly suppressions (#166178)
Cleaning up ignores that are no longer needed in the repo and adding select suppressions so the main branch is clean.

test plan:
`lintrunner -a`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166178
Approved by: https://github.com/oulgen
2025-10-25 05:32:21 +00:00
Jane Xu
7924e3aacf Remove likely unnecessary _EXPAND trick for non-windows in HIDDEN_NAMESPACE_BEGIN (#166203)
I've learned that the EXPAND trick is needed mostly for an MSVC quirk to properly expand arguments. I tested on Linux locally and suspect that we don't need the _EXPAND for non-windows.  This PR is BE to minimalize what we need and remove what we don't, but I'm also okay not landing this if @malfet tells me that this quirk goes beyond MSVC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166203
Approved by: https://github.com/malfet
ghstack dependencies: #166076, #166077, #166078, #166079
2025-10-25 04:44:07 +00:00
Jason Ansel
78bcfcf870 [fx] Optimize torch.fx.Node.replace_all_uses_with (#165889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165889
Approved by: https://github.com/aorenste
2025-10-25 03:44:41 +00:00
Ke Wen
1e2e7cb18b Add doc for Symmetric Memory (#166148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166148
Approved by: https://github.com/fduwjj
2025-10-25 03:41:15 +00:00
Justin Chu
003601a70d Set prefer_deferred_runtime_asserts_over_guards to True (#165820)
Set prefer_deferred_runtime_asserts_over_guards to True and allow a flag to control the behavior, just in case.

This option has enable the gemma3 model export with transformers==4.57. I am not sure how best to test it though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165820
Approved by: https://github.com/titaiwangms
2025-10-25 03:38:19 +00:00
bobrenjc93
1d58d5fe25 [hops] fix unbacked runtime asserts for cond higher order op (#165893)
At a high level after this fix we get the following nice tlparse https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/bobren/54a57665-7dcc-41e0-8ca7-df01393cd4aa/custom/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

As seen in this doc, previously we were simply dropping assert post
dynamo: https://docs.google.com/document/d/1nRQwvw_gWL0_9T3VKb5Ly3_tNI1fgqG9WtryeD6qaZI/edit?tab=t.0

The fixes are a couple things:

1) Actually run the runtime assertion fx graph pass on subgraphs
2) Reset fake mode unbacked memo across speculate subgraph invocations
   since the memos actually break the runtime assertion insertions since
   calls like nonzero end up not allocating new unbacked symints and
   hence not populating pending_unbacked which then results in incorrect
   unbacked_bindings on fx_nodes in subgraphs.

This is a first step in hardening runtime asserts across all phases of
the compiler (eager, aot_eager, inductor, etc.). I will continue kicking
tires and fixing bugs until we get runtime assert generations in a good
place. One obvious next step is the added test case in this PR fails
when compiled with inductor with the following error (NB: it fails before this PR as well):

```
  File "/data/users/bobren/a/pytorch/torch/_inductor/ir.py", line 659, in get_dtype
    return self.dtype
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AttributeError: 'ShapeAsConstantBuffer' object has no attribute 'dtype'
  target: cond
  args[0]: Eq(Mod(s77, 4), 0)
  args[1]: Subgraph(name='true_graph_0', graph_module=<lambda>(), graph=<torch._inductor.graph.SubgraphLowering object at 0x7fbcbb11e110>)
  args[2]: Subgraph(name='false_graph_0', graph_module=<lambda>(), graph=<torch._inductor.graph.SubgraphLowering object at 0x7fbcbb21cf70>)
  args[3]: (s77, TensorBox(StorageBox(
    ComputedBuffer(name='buf0', layout=FlexibleLayout('cuda:0', torch.float32, size=[s77, s77], stride=[s77, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.float32, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7fbcbb2f37f0>, ranges=[s77, s77]))
  )))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165893
Approved by: https://github.com/zou3519
2025-10-25 03:25:36 +00:00
Yiming Zhou
de7fdfe41a Export flex attention with kwargs and DTensor (#166045)
Fixes #165948

Adding registration of the MaskBlock makes flex attention with kwargs exportable.

Also modified unittests to accept kwargs

```
python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export

python test/inductor/test_flex_attention.py -k test_pytree_
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166045
Approved by: https://github.com/drisspg
2025-10-25 03:17:22 +00:00
Nicolas De Carli
b31bad1b8f [Pytorch] Enable autovec on aarch64 for type conversion (#166049)
Summary:
Implementing autovec template for type conversions on aarch64-NEON

Generated code can be seen here: https://godbolt.org/z/1K6T1d9TE

We've seen significant performance improvements for converting to and from bytes, compiling using clang with -march=armv9-a+sve2:

Before
float->uint8->float ===> 683.212us
float->int8->float ===> 687.846us
int32->uint8->int32 ===> 497.121us
int32->int8->int32 ===> 481.889us

After:
float->uint8->float ===> 198.204us  ----> 245% higher throughput
float->int8->float ===> 200.241us ----> 244% higher throughput
int32->uint8->int32 ===> 197.970us ----> 151% higher throughput
int32->int8->int32 ===> 198.206us ----> 143% higher throughput

Test Plan:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Differential Revision: D85213420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166049
Approved by: https://github.com/ezyang, https://github.com/mcfi, https://github.com/aditew01
2025-10-25 02:55:50 +00:00
Natalia Gimelshein
2efcf3ca98 Reverts #163712 and forces allgather/scatter inputs/outputs to be contiguous (#166181)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166181
Approved by: https://github.com/kwen2501
2025-10-25 02:43:10 +00:00
glen-amd
761f946043 [ROCm] new implementation of upsample_bilinear2d_backward (#164572)
Changed the implementation from an output-based approach to an input-based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup.

The changes are from Yu-Yun <YuYun.Chang@amd.com>.

# Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X
- The original "scatter-add" approach
  - Each thread, representing an output pixel, scattered gradient contributions to four input pixels, using costly atomic operations on MI300X/MI325X GPUs.
- The new "gather-sum" approach
  - Each thread is responsible for a single input pixel and gathers all relevant gradient contributions from a small, calculated region of the output tensor (done by the `compute_output_range` device function).
# Breakdown of the code changes
- Inversion of the parallelization strategy of the kernel function `upsample_bilinear2d_backward_out_frame`
  - Originally, the main kernel loop was parallelized over the number of elements in the output gradient tensor (`const size_t o_numel = nc * width2 * height2;`).
    - Each thread processed one output pixel.
  - The new loop is parallelized over the number of elements in the input gradient tensor (`const size_t i_numel = nc * height1 * width1;`).
    - Each thread is responsible for calculating the final gradient for a single input pixel.
  - The kernel launch changes accordingly in the function `upsample_bilinear2d_backward_out_cuda_template`.
- Added a device function for calculating the range of output pixels that could have possibly used that the input pixel (`input_pos`) during the forward pass interpolation
  - This is essentially the mathematical inverse of the forward pass.
  - This function tries to prune a thread's search space so that it only needs to inspect a small, local window of the output tensor.
- Gradient calculation approach switching from "scatter-add" to "gather-sum"
  - Scatter-add
    - For each output pixel, the thread calculated 4 gradient contributions and use `fastAtomicAdd` 4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor.
  - Gather-sum
    - A thread responsible for one input pixel calls `compute_output_range` to determine the small rectangular region of output pixels that influence the input's final gradient value.
    - The thread iterates through this region, and for each output pixel in the regionre, it re-calculates the interpolation weights to determine the exact contribution to its specific input pixel.
    - All these contributions are accumulated into a private, per-thread register variable (`accscalar_t grad_sum = 0;`).
      - W/o any gloabl memory access, this accumulation is extremely fast.
    - When the loops are done, the thread performs a single, direct write (non-atomic) of the final summed gradient to its designated location in global memory (`idata[index] = static_cast<scalar_t>(grad_sum);`).
# Why performance gets boosted
- Analysis of the root cause of performance drop
  - Ref. (internal only) - https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1140493327/PyTorch__upsample_bilinear2d_backward
- First and foremost, elimination of the contention of atomic operations
  - Many parallel threads called `atomicAdd` frequently attempting to update the exact same memory location in the input gradient tensor at the same time.
    - The GPU's memory controler has to serialize these operations, effectively nullifying the benefit of parallel capability at those contention points.
  - MI300X/MI325X chiplet-based CDNA 3 architeture amplified the issue.
    - When contending threads reside on different XCDs, resolving the atomic operation requires high-latency coherence traffic across the Infinity Fabric interconnect.
  - The implementation change eliminates hardware-level serialization and cross-chiplet coherence traffic caused by many `atomicAdd`.
- Improved memory access pattern and locality
  - Write coalescing
    - The regular sum writes `idata[index] = static_cast<scalar_t>(grad_sum);` can be perfectly coalesced by GPUs.
  - Read locality
    - Even though there are many (potentially repeated) reads from the output tensor (`static_cast<accscalar_t>(odata[output_idx])`), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread.
- Trade-off: computation for memory synchronization
  - The recalculation of interpolation weights fits well on high-computational-throughput modern GPUs like MI300X/MI325X.
  - Removal of atomic operations avoids expensive memory synchronization.

---

Optimizations of `grid_sampler_2d_backward` will be addressed in a separate PR.
Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164572
Approved by: https://github.com/jeffdaily

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-10-25 02:39:24 +00:00
Nikita Shulga
8aa465f18e [MPS] Migrate angle to Metal ops (#166210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166210
Approved by: https://github.com/Skylion007
2025-10-25 01:52:33 +00:00
Animesh Jain
0a5d68d92d [dynamo] Remove unnecessary NAME_MATCH guard (#166112)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166112
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166155
2025-10-25 01:27:42 +00:00
Animesh Jain
42bd210fff [dynamo] Avoid ID_MATCH on methods - use CLOSURE_MATCH on functions (#166155)
id on methods can change from invocation to invocation. Here we guard on
__code__ objects which does not change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166155
Approved by: https://github.com/jansel
2025-10-25 01:27:42 +00:00
FFFrog
1d13c314b3 [OpenReg] Remove the Unnecessary Fallback Implementation for AutogradPrivate1 (#165316)
As the title stated.

The fallback for AutogradPrivateUse1 is builtin in PyTorch, so it is no need to register general implementation for out of tree backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165316
Approved by: https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #165315
2025-10-25 01:27:27 +00:00
FFFrog
0c9763a5a0 [Autograd] Add Default Autograd Fallback for PrivateUse1 in PyTorch (#165315)
Please refer to this [link](https://github.com/pytorch/pytorch/issues/163979) for more background.

- Allow register fallback for AutogradPrivateUse1 multiple.
- Add Autograd fallback implemetation for AutogradPrivateUse1

PyTorch can privide a common implementation for AutogradPrivateUse1, and the user can override it based on the need of specififc accelerator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165315
Approved by: https://github.com/albanD
2025-10-25 01:27:27 +00:00
nick-kuhn
79a4a9c02e Fix race condition and make CUDA kthvalue deterministic (#165762)
The gatherKthValue kernel had a race condition where multiple threads could write to the same output location without synchronization when duplicate k-th values exist, resulting in non-deterministic output.

Changes:
- aten/src/ATen/native/cuda/Sorting.cu: Use atomicMin with shared memory to deterministically find minimum index. Add early termination and remove redundant inRange checks. (We have to cast the index to `int32_t`, but this is already assumed to fit earlier in the kernel.)
- aten/src/ATen/native/cuda/Sorting.cpp: Remove non-deterministic alert since kthvalue is now deterministic on CUDA.
- torch/__init__.py: Remove kthvalue from non-deterministic operations list and remove kthvalue example from use_deterministic_algorithms() docstring.
- test/test_torch.py: Remove test_nondeterministic_alert_kthvalue since kthvalue no longer raises alerts on CUDA.

Benefits:
- Deterministic: always returns minimum index when duplicates exist
- Potential performance improvement on large arrays with repetitions

Test Results:
- All existing PyTorch tests pass (test_kthvalue)
- Custom determinism tests confirm consistent results
- Custom CUDA vs CPU correctness validated across 50+ scenarios
- Custom performance benchmarks show improvements with no visible regressions

Addresses #165227

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165762
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-25 00:45:57 +00:00
Yuanyuan Chen
9d0b77f4cd [10/N] Apply ruff UP035 rule (#165709)
This is a follow-up of #165515. ruff `UP035` rules are applied to  dynamo code to use Py 3.10+ typing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165709
Approved by: https://github.com/ezyang
2025-10-25 00:20:13 +00:00
Jane Xu
d486eee234 Hide APIs in torch::headeronly (#166079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166079
Approved by: https://github.com/malfet, https://github.com/cyyever
ghstack dependencies: #166076, #166077, #166078
2025-10-25 00:18:26 +00:00
Jane Xu
cddd5f74ab Hide stable Library structs instead of using anon namespace (#166078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166078
Approved by: https://github.com/malfet
ghstack dependencies: #166076, #166077
2025-10-25 00:18:26 +00:00
Jane Xu
dfdb68e51f Hide all APIs in torch::stable (#166077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166077
Approved by: https://github.com/malfet
ghstack dependencies: #166076
2025-10-25 00:18:26 +00:00
Jane Xu
98c818320a Add HIDDEN_NAMESPACE_BEGIN and END macros for hiding header APIs (#166076)
Spurred by the conversation started in https://github.com/pytorch/pytorch/issues/163343.

Context:
* Header implementations may be inlined _but_ are not necessarily inlined, even when using the `inline` keyword.
* When someone wants to use multiple extensions in the same runtime, e,g., with FA3 and AO, then 2 `.so`s are loaded that may have been built with different libtorch versions. Thus, if an API is not inlined and are differently implemented, one implementation will be arbitrarily picked up and used across the runtime, depending on link order. This is bad!
* Consequently, we need to be very good at guaranteeing that we don't modify header implementations within a namespace. This is easy to mess up by accident, which would be a dire mistake.

Solution:
In essence, we want APIs in torch::headeronly and torch::stable to be visible in each individual extension only, and nowhere else. We want to hide these symbols! Thankfully, pybind already solved this problem (thanks @malfet for bringing that to my attention). This PR is heavily inspired by the code in pybind here: e6984c805e/include/pybind11/detail/pybind11_namespace_macros.h (L73-L82).

In this PR, we introduce the macros for defining hidden namespaces in PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166076
Approved by: https://github.com/malfet
2025-10-25 00:18:26 +00:00
drisspg
cc20b7ad72 [FlexFlash] update names (#166193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166193
Approved by: https://github.com/BoyuanFeng
2025-10-25 00:07:11 +00:00
Shunting Zhang
bc11a42b3f [inductor][ez] fix score fusion memory typo (#166029)
Fix https://github.com/pytorch/pytorch/issues/165724 .
The typo does not affect the compilation result. It just affect compilation time a little bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166029
Approved by: https://github.com/eellison
2025-10-24 23:48:05 +00:00
Catherine Lee
4fc06f2e0a Use std::min for #165927 (#166199)
Summary: Like D85463674 (pr https://github.com/pytorch/pytorch/pull/166195) but for D85357351 (https://github.com/pytorch/pytorch/pull/165927)

Differential Revision: D85464917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166199
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/Skylion007
2025-10-24 23:19:00 +00:00
Malay Bag
82473c3d59 [torch.export] Add original module type to UnflattenedModule class (#166145)
Summary: Currently all sub modules of UnflattenedModule have orginal type name. This diff will orginal type for UnflattenedModule.

Test Plan:
```
buck test mode/opt caffe2/test:test_export
```
https://www.internalfb.com/intern/testinfra/testrun/17732923654320197

Differential Revision: D85373454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166145
Approved by: https://github.com/angelayi
2025-10-24 22:47:29 +00:00
Richard Zou
b6a4236e5d [label_to_label] minor updates (#166172)
vllm-compile implies "module: vllm" and "oncall: pt2".
The volume of issues in Flex -> HigherOrderOperators is too noisy,
plus we have a different set of folks looking at each, so I'm going to
make that not automatic anymore. We can still manually label flex issues
as higher order operator issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166172
Approved by: https://github.com/angelayi
2025-10-24 22:47:23 +00:00
Ti-Tai Wang
b04173be9b [ONNX] Add a test to backed_size_oblivious patch in onnx (#166196)
Follow-up https://github.com/pytorch/pytorch/pull/166151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166196
Approved by: https://github.com/justinchuby
2025-10-24 22:47:10 +00:00
Catherine Lee
32ac38f85d [lint] workflow consistency linter to look at all files instead of just changed files (#165171)
As in title

If you change only one workflow file, lintrunner (default arg, also the one in CI since it only inputs changed files) won't look at other files in the repo, but the sync-tag might come from those other files

This makes it so that it looks at all workflow files so it will catch those failures

Also change output line so it prints which file + which job it is different from

Pros:
catches errors

Cons:
unusual behavior (getting around what lintrunner says the linter should run on)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165171
Approved by: https://github.com/malfet, https://github.com/izaitsevfb, https://github.com/atalman
2025-10-24 21:43:18 +00:00
Kurt Mohler
c9b49e506e [MPS] Add linalg.householder_product for MPS (#166090)
Fixes #166089
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166090
Approved by: https://github.com/malfet
2025-10-24 21:13:56 +00:00
Xiao
6038e476e8 [Dynamo][Logging]Fix regression on stack adding to latest bytecode by… (#165946)
… adding verbose check (#165926)

[ghstack-poisoned]

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165946
Approved by: https://github.com/williamwen42
2025-10-24 20:36:50 +00:00
Shunting Zhang
2c851c16e5 [FX][ez] fix the split_module tutorial code (#166154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166154
Approved by: https://github.com/BoyuanFeng
2025-10-24 20:16:04 +00:00
Edward Yang
31584f2d91 Add a Claude skill for writing docstrings. (#166175)
Generated with prompt:

> torch/_tensor_docs.py and torch/nn/functional.py contain the "gold standard" for docstrings in the PyTorch project. Write a skill describing how to write a docstring for a function/method in the PyTorch project. Note that add_docstring is specifically for C binded functions; a native Python function can just be a direct docstring. Sphinx is used to generate docs.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166175
Approved by: https://github.com/Skylion007
2025-10-24 20:05:44 +00:00
Blaine Burton Rister
0442125362 [Inductor] Restore original dtype for rank-0 CPU tensors (#166118)
# Problem
Inductor implicitly upcasts certain rank-0 kernel arguments from float16 to float32. Currently, this happens only on the `"cpu"` device, which appears to be related to float16 support in CPU Triton. However, it can also affect the behavior of GPU kernels, when a model contains tensors from multiple devices. Upcasting may be undesirable on some platforms, so users can typically disable it with the `config.triton.codegen_upcast_to_fp32` flag. However, this flag was not respected by the rank-0 kernel argument codepath.

Through an improbable series of events, float32 upcasting caused an internal model to fail compilation on MTIA. (Internal reviewers see T242444110.)

# Fix
If `config.triton.codegen_upcast_to_fp32` evaluates to `False`, cast the kernel argument to the original dtype.

# Test plan
Added a new CI test checking for the downcast iff the config flag is false. The test mixes GPU and CPU tensors to generate a GPU kernel with the implicit float32 upcast and explicit float16 downcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166118
Approved by: https://github.com/jfix71, https://github.com/jansel, https://github.com/kundaMwiza
2025-10-24 19:59:25 +00:00
Yang Wang
fdcf402d82 vllm test build (#166146)
FIx the vllm test build it's broken due to the flashinfer dependency
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166146
Approved by: https://github.com/huydhn
2025-10-24 19:18:10 +00:00
Ahmad Sarvmeily
13cda9b89e Allow BlockDescriptorOptions classes to be overridden In TritonKernel (#165899)
By allowing the options classes (`BlockPtrOptions`/`TensorDescriptorOptions`) to be overridden in `TritonKernel`, subclasses with custom behaviour can be used in place of them, which provides greater flexibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165899
Approved by: https://github.com/jansel
2025-10-24 18:59:59 +00:00
Isalia20
fa6d911dda [MPS] Sparse mul enable tests and fix on MPS (#166164)
Apparently mul tests in test_sparse were disabled. The dense representation i.e. when nnz is not a scalar was broken on MPS. This PR fixes it and enables the tests in test_sparse.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166164
Approved by: https://github.com/malfet
2025-10-24 18:30:30 +00:00
Shunting Zhang
0db6bcc015 Fix accuracy for layernorm/rmsnorm benchmarking (#166005)
Example command:
    python benchmarks/dynamo/genai_layers/benchmark.py --exit-on-accuracy-failure --tolerance=1e-2 rmsnorm_backward

Fix the accuracy problem for layernorm/rmsnorm fwd/bwd.
Also fix some quack calls (maybe due to quack API change)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166005
Approved by: https://github.com/BoyuanFeng
2025-10-24 18:14:51 +00:00
eqy
60ac039998 [CUDA][Grouped Gemm] remove xFail on Group GEMM tests after fallback was added (#165378)
https://github.com/pytorch/pytorch/pull/162059 means we get unexpected successes now on e.g., SM 12.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165378
Approved by: https://github.com/Skylion007
2025-10-24 17:42:40 +00:00
PyTorch MergeBot
380d440d1c Revert "inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040)"
This reverts commit 9038a30cee.

Reverted https://github.com/pytorch/pytorch/pull/164040 on behalf of https://github.com/karthickai due to Kindly add the test case mentioned in the issue ([comment](https://github.com/pytorch/pytorch/pull/164040#issuecomment-3444137989))
2025-10-24 17:14:45 +00:00
Jupiter-Guy
9038a30cee inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040)
…semantics on views; add regression test for transposed mutation (fixes #163929)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164040
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-10-24 16:37:43 +00:00
PyTorch MergeBot
690c8c13b9 Revert "Export should use aot_export_joint_with_descriptors (#165931)"
This reverts commit 882b834082.

Reverted https://github.com/pytorch/pytorch/pull/165931 on behalf of https://github.com/clee2000 due to breaking internal tests D85084301 for test_auto_functionalize?  I checked that they did run on OSS CI so I'm not entirely sure whats going on, I assume its the IS_FBCODE stuff ([comment](https://github.com/pytorch/pytorch/pull/165931#issuecomment-3443887361))
2025-10-24 16:02:20 +00:00
PyTorch MergeBot
28ee6b62ed Revert "[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358)"
This reverts commit 5a4997dcae.

Reverted https://github.com/pytorch/pytorch/pull/163358 on behalf of https://github.com/clee2000 due to probably need to revert this one  too, its stacked with https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389 ([comment](https://github.com/pytorch/pytorch/pull/163358#issuecomment-3443874910))
2025-10-24 15:58:54 +00:00
PyTorch MergeBot
81577bdb3f Revert "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003)"
This reverts commit 8625ffbd45.

Reverted https://github.com/pytorch/pytorch/pull/166003 on behalf of https://github.com/clee2000 due to failing internal tests D85405179 I believe there are uses of _flatten_mesh_list internally that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389))
2025-10-24 15:14:23 +00:00
Yuanyuan Chen
e67e3d95f3 Simplify the CUPTI CMake check for kineto (#161370)
Simplify the CUPTI check because kineto has used `CUDA::cupti`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161370
Approved by: https://github.com/Skylion007
2025-10-24 08:13:17 +00:00
eellison
27af8480ea Refactor api and configs of overlapping (#166130)
- pass important configs values directly into the class
- migrate those configs from `test_configs` to another class
- add an (off by default) config to enable inside inductor, instead of requiring a custom post pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166130
Approved by: https://github.com/bdhirsh
2025-10-24 07:03:54 +00:00
Pian Pawakapan
6494cdc40c [DebugMode] add nn.Module tracking (#165498)
Uses ModTracker to record nn.Module entries, much like CommDebugMode.

Can be switched on with `DebugMode(record_nn_module=True)`:
```
    [nn.Mod] Bar
      [nn.Mod] Bar.abc
        [nn.Mod] Bar.abc.l1
          aten::t(t: f32[4, 4])
          aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])
        [nn.Mod] Bar.abc.l2
          aten::t(t: f32[4, 4])
          aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])
      [nn.Mod] Bar.xyz
        aten::t(t: f32[4, 4])
        aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])"""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165498
Approved by: https://github.com/SherlockNoMad
2025-10-24 05:08:33 +00:00
Shintaro Iwasaki
ac7074efa2 [CUDA][cuBLAS] Fix a compilation issue in #163955 when CUDA_VERSION < 12010 (#166137)
Summary:
This PR fixes a compilation issue when `CUDA_VERSION < 12010`. Even if we might drop old CUDA support, let's correct the code itself.

## Issue
When `CUDA_VERSION` is `12010`, the following does not compile.
```
      mat1_sizes[0] > 1 && mat1_sizes[1] > 1 &&
      mat2_sizes[0] > 1 && mat2_sizes[1] > 1
      #if !(defined(CUDA_VERSION) && CUDA_VERSION >= 12010 || defined(USE_ROCM))
      // Here not "&&"
      mat2_sizes[0] < 65535 * 32 && mat2_sizes[1] < 65535 * 32 &&
```
This patch adds "&&"

Test Plan: CI

Differential Revision: D85356831

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166137
Approved by: https://github.com/ngimel, https://github.com/cyyever
2025-10-24 04:06:03 +00:00
Darshan Sanghani
263901cec4 [pytorch/kineto] Update Kineto Submodule (#166150)
Summary: Update to include some race condition fixes.

Test Plan: n/a

Differential Revision: D85390799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166150
Approved by: https://github.com/sraikund16, https://github.com/cyyever
2025-10-24 04:03:13 +00:00
Ti-Tai Wang
c12293dcbe [ONNX] Cover all FX passes into backed size oblivious (#166151)
Found a bug that after `run_decomposition()`, the shape could be fixed to 1. It's caused by the fact that all FX graph (related to shape inference) surgery should happen inside backed size oblivious patch.

```python
import torch
from transformers.models.phi3.modeling_phi3 import Phi3RMSNorm

# Previous to this PR, this will generate a fixed batch size
op = torch.onnx.export(
    Phi3RMSNorm(256).eval(),
    args=(),
    kwargs={"hidden_states": torch.rand((1, 32, 256))},
    dynamic_shapes={"hidden_states": {0: torch.export.Dim.DYNAMIC, 1: torch.export.Dim.DYNAMIC}},
)

# It is dynamic when it's only in torch.export
with torch.fx.experimental._config.patch(backed_size_oblivious=True):
    ep = torch.onnx.export(
    Phi3RMSNorm(256).eval(),
    args=(),
    kwargs={"hidden_states": torch.rand((1, 32, 256))},
    dynamic_shapes={"hidden_states": {0: torch.export.Dim.DYNAMIC, 1: torch.export.Dim.DYNAMIC}},
)
# But when run_decomposition is called outside of the patch, it is static.
# ep = ep.run_decompositions()
print(ep)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166151
Approved by: https://github.com/justinchuby
2025-10-24 03:25:16 +00:00
fduwjj
5a4997dcae [DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358)
Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users.

One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358
Approved by: https://github.com/fegin
ghstack dependencies: #166003
2025-10-23 23:31:17 +00:00