Commit Graph

88238 Commits

Author SHA1 Message Date
PyTorch UpdateBot
db1f33147b [audio hash update] update the pinned audio hash (#154001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154001
Approved by: https://github.com/pytorchbot
2025-05-23 03:51:21 +00:00
Laith Sakka
c1055f41a6 Data dependent free reshape. (#153198)
#### change 1: if compute_strides stride fail for reshape just clone.

Lets consider the most general case, if torch compile is asked to reshape [u0, u1][u3, u4] -> [u5, u6] what shall it do?
The shape is general enough to represent both contiguous and non contiguous tensors, tensors where a clone free reshape can happen and other where a clone free cant happen.  The current algorithm will fail due to data dependent errors.

The general idea is if its impossible to tell if the reshape can happen in place, (because for some concrete inputs
it will and other not) then its ok to take the general path and clone, instead of failing or asking the user to give hints.
**Because the user want a single graph (single compilations)** and this is the only way it can be done.
Had this been a view? then the user is explicitly asking for a copy-free reshape, we would fail asking for more
information (hints in torch.checks form).

with this change reshape works as the following:
1. if we know the input is contiguous we will convert the reshape to view.
2. if compute_strides succeed we will use view. (compute_strides  was changed to not fail when when unbacked presented instead it will just return nullptr if it cant compute the strides meaning we shall use a clone).
3. if neither 1, 2 works clone and use a view.

Side note: having a view does not mean that inductor will not clone, for inductor there is a pass that converts all views back to reshapes and inductor has its logic dealing with those.

#### change 2 : skip  _reshape_view_helper and fall back to simpler logic if it fail.
We trace _reshape_view_helper when doing fake tensor tracing , but not during proxy tracing. hence such tracing wont effect the graph (only compute output shapes of several operations). We should not fail there, because it should always be possible for us to pass it in case of reshape.

i.e. when reshape_symint was called we would have either cloned, or compute_strides succeeded so the view should pass. What I did is the following: we run _reshape_view_helper, if we fail due to unbacked we call _view_simple which will succeed always for reshapes, (might fail for views when its impossible to do the view, in such case we throw the dde that was thrown by the original algorithm).

Ideally I would want to register _view_simple as the meta for view and avoid calling  _reshape_view_helper completely but I am running some issues with the dispatcher with subclasses and I do not have time to debug it. Namely one test
would end up calling some c++ view function that does not support symints during meta dispatch when i register a
python meta decompositions
```python test/dynamo/test_subclasses.py SubclassTests.test_subclass_views_dynamic_True ```
 https://github.com/pytorch/pytorch/issues/153303.I will follow up with that change in a separate PR.  cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @bdhirsh

 Two other alternatives for registering   _view_simple as meta and the try catch approach in this PR is:
 1. call _view_simple if any input is dynamic see  #153521
 2. if we make is_compiling works for framework code tracing (does not work rn) we can call _view_simple
 is if is_compiling.

#### Note:
Reshape can still fail when is_contiguous is called, Next PR will handle that by calling is_known_contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153198
Approved by: https://github.com/etaf, https://github.com/bobrenjc93
2025-05-23 01:45:16 +00:00
Ruisi Zhang
f74842d665 [DTensor] enable SimpleFSDP's composability with Tensor Parallel (#152286)
This PR adds support for SimpleFSDP's composability with Tensor Parallel + torch.compile.

`_StridedShard` is used in SimpleFSDP/FSDP2 to support correct distributed checkpointing when FSDP+TP is applied. Previously, `_StridedShard` is not guarded by torch.compile. This PR adds `_StridedShard` as an additional placement type to be guarded by torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152286
Approved by: https://github.com/bdhirsh
2025-05-23 01:40:38 +00:00
Huy Do
7509b150af Don't upload compiler benchmark debug info to the benchmark database (#153769)
During our debug session, @wdvr and I found out that the benchmark database is growing much faster than we expect.  After taking a closer look, the majority of them coming from TorchInductor benchmark and the top 3 are all debug information not used by any dashboard atm.  In the period of 7 days, there are close to 6 millions records ([query](https://paste.sh/GUVCBa0v#UzszFCZaWQxh7oSVsZtfZdVE))

```
Benchmark,Metric,Count
"TorchInductor","user_stack","1926014"
"TorchInductor","reason","1926014"
"TorchInductor","model","1926014"
```

Let's skip uploading them to avoid bloating the database.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153769
Approved by: https://github.com/malfet
2025-05-23 01:18:26 +00:00
Benjamin Glass
768cb734ec cpp_wrapper: build non-performance-sensitive code at O1 (#148773)
Builds on #148212, applying the same improvements to `cpp_wrapper` mode.

Benchmark results:

* [A100 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)
* [x86 Benchmarks](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2014%20May%202025%2015%3A10%3A05%20GMT&stopTime=Wed%2C%2021%20May%202025%2015%3A10%3A05%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(x86)&lBranch=gh/benjaminglass1/77/orig&lCommit=ca7d0a3f16e3c511534d2cd03d695be8524570d3&rBranch=main&rCommit=1075bb37d34e483763a09c7810790d5491441e13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148773
Approved by: https://github.com/desertfire
2025-05-23 00:51:20 +00:00
Svetlana Karslioglu
3c0cbf4b44 Update GH action to use the correct label (#154126)
Update GH action to use the correct label for the docathon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154126
Approved by: https://github.com/AlannaBurke, https://github.com/clee2000
2025-05-23 00:29:43 +00:00
Aaron Gokaslan
31f3ee0966 [BE][Ez]: Enable PT014 check for duplicate parameterize test cases (#154118)
Ruff rule which checks for an error [PT014](https://docs.astral.sh/ruff/rules/pytest-duplicate-parametrize-test-cases/) where a user might specify two duplicate test cases in pytest.parameterize, which is likely an error since it tests the same thing twice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154118
Approved by: https://github.com/malfet
2025-05-23 00:00:53 +00:00
xinan.lin
7b25ff7cf2 [Inductor] Add attention pattern for model DistilBert in transformers==4.44.2. (#154091)
This PR add a attention fusion pattern that match the attention of
DistilDistilBert in transformers==4.44.2 at
953196a43d/src/transformers/models/distilbert/modeling_distilbert.py (L212)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154091
Approved by: https://github.com/jansel, https://github.com/eellison
2025-05-22 23:37:03 +00:00
PyTorch MergeBot
59c5fff2aa Revert "[DDP] rebuilt bucket order when find_unused_parameters=true (#153404)"
This reverts commit a79e621c1c.

Reverted https://github.com/pytorch/pytorch/pull/153404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153404#issuecomment-2902741300))
2025-05-22 22:26:59 +00:00
Yuxuan Chen
f2cce45657 [libc++ readiness][caffe2] No reason to check for "ext/stdio_filebuf.h" (#154080)
Summary: There should be no reason to check for existence of this GNU C++ header here in this file. It doesn't include it. Removing this condition to make it build under libc++.

Differential Revision: D75179136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154080
Approved by: https://github.com/soumith
2025-05-22 22:23:39 +00:00
Chen Lai
c985cec5b2 Patch the _is_conv_node function (#153749)
Summary: torch.ops.aten.conv2d.padding is also conv2d node

Differential Revision: D74898941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153749
Approved by: https://github.com/andrewor14, https://github.com/Skylion007
2025-05-22 22:17:02 +00:00
bobrenjc93
413664b3c5 catch CSE recursion depth errors (#154039)
Fixes #153777

CSE is an optimization and shouldn't block a compile if it hits recursion depth limits. Unfortunately we can't write this iteratively due to a dependency on `ast.unparse` which necessarily needs to do recursion. This PR catches opts out of CSE when we hit recursion depth errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154039
Approved by: https://github.com/Microve
2025-05-22 20:17:19 +00:00
Rachel Guo
cad0727fe1 Rename the provenance tracing artifact name for kernel <-> post_grad nodes mapping (#154046)
Summary:
Context:

Recently we've added a couple more kernel types support other than inductor generated triton kernels,

such as cpu cpp kernels, extern kernels.

The name appeared in tlparse chrome link can be confusing to users.

Rename from

`inductor_triton_kernel_to_post_grad_nodes.json`

to `inductor_generated_kernel_to_post_grad_nodes.json`

Test Plan: CI

Differential Revision: D75159042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154046
Approved by: https://github.com/yushangdi
2025-05-22 19:20:56 +00:00
atalman
4277907d02 [binary builds] Linux aarch64 CUDA builds. Make sure tag is set correctly (#154045)
1. This should set the Manylinux 2.28 tag correctly for CUDA Aarch builds.
I believe we used to have something similar in the old script:
https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/build_aarch64_wheel.py#L811

``Tag: cp311-cp311-linux_aarch64 ``-> ``Tag: cp311-cp311-manylinux_2_28_aarch64``

2. Remove section for CUDA 12.6, since we no longer building CUDA 12.6 aarch64 builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154045
Approved by: https://github.com/Camyll, https://github.com/malfet
2025-05-22 18:36:13 +00:00
Menglu Yu
788d9cb2d7 [3/n][Optimus][Auto-AC][reland] Support any fp8 quantization type and set scaling as the default" (#154057)
Summary:
This is a reland of D74910193.
We change the dtype to torch.float8_e5m2 in unit test since it is not supported.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization
```

Differential Revision: D75169792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154057
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:26:34 +00:00
skishore
c2660d29a5 [ROCm] Added unit test to test the cuda_pluggable allocator (#154041)
Added unit test to include the cuda_pluggable allocator and replicate the apex setup.py to build nccl_allocator extension

This test to check if this commit https://github.com/pytorch/pytorch/pull/152179 helps to build the cuda pluggable allocator in Rocm/Apex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154041
Approved by: https://github.com/atalman, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-05-22 18:22:15 +00:00
Menglu Yu
5b8f422561 [PT2][Optimus] Fix a typo in decompose_mm (#154048)
Summary: As titled

Differential Revision: D75160513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154048
Approved by: https://github.com/Mingming-Ding
2025-05-22 18:11:40 +00:00
Nikita Shulga
633ed01145 [MPS] Add support for two more isin variants (#154010)
`isin_Tensor_Scalar_out` is just a redispatch to eq/neq
`isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars
Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps`

Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage

Before this change both of those failed
```python
>>> import torch
>>> t = torch.tensor([0, 1, 2], device='mps')
>>> torch.isin(t, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
>>> torch.isin(1, t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales
ghstack dependencies: #153970, #153971, #153997
2025-05-22 17:59:35 +00:00
Xu Han
7421c21b5e remove unused code. (#153979)
Remove the unused cmake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153979
Approved by: https://github.com/albanD
2025-05-22 17:50:11 +00:00
Yidi Wu
fc859077a0 [export][cond] support merging constant ints as unbacked symint (#152742)
@pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)) with the following pattern:
```python
idx = return 0 if u0 else return 1
return  x[idx]
```
We could preserve the conditional with a cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742
Approved by: https://github.com/zou3519
2025-05-22 17:25:38 +00:00
PyTorch MergeBot
025c5cc048 Revert "[inductor][cutlass backend] Add 2 stage autotuning aka prescreening (#153335)"
This reverts commit d23762974e.

Reverted https://github.com/pytorch/pytorch/pull/153335 on behalf of https://github.com/yangw-dev due to sorry the pr is failed internally [D75155648](https://www.internalfb.com/diff/D75155648) ([comment](https://github.com/pytorch/pytorch/pull/153335#issuecomment-2901916364))
2025-05-22 16:52:04 +00:00
PyTorch MergeBot
7d3dab6b90 Revert "[BE]: Type previously untyped decorators (#153726)"
This reverts commit b7d08defe9.

Reverted https://github.com/pytorch/pytorch/pull/153726 on behalf of https://github.com/yangw-dev due to sorry, it seems like your pr failed typecheck error internally, [D75155486](https://www.internalfb.com/diff/D75155486) ([comment](https://github.com/pytorch/pytorch/pull/153726#issuecomment-2901911114))
2025-05-22 16:49:08 +00:00
Michael Lazos
a15550b776 [Cutlass] Use env var for EVT flag (#154099)
Swaps out hard flag for environment variable in inductor config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154099
Approved by: https://github.com/eellison
2025-05-22 16:36:57 +00:00
PyTorch MergeBot
a82c8891d5 Revert "[aoti] Add MPS runner and shim (#153964)"
This reverts commit 918ae5d361.

Reverted https://github.com/pytorch/pytorch/pull/153964 on behalf of https://github.com/angelayi due to broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153964#issuecomment-2901876832))
2025-05-22 16:35:59 +00:00
PyTorch MergeBot
47a01f3efb Revert "[aoti] Initial Metal support (#153959)"
This reverts commit 28bcd9eb30.

Reverted https://github.com/pytorch/pytorch/pull/153959 on behalf of https://github.com/angelayi due to previous PR broke frl build ([comment](https://github.com/pytorch/pytorch/pull/153959#issuecomment-2901825315))
2025-05-22 16:17:07 +00:00
Isuru Fernando
f419373dd3 [inductor] lowering for fractional_max_pool3d (#148630)
also a lowering with a reduction for large window_sizes for
fractional_max_pool2d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148630
Approved by: https://github.com/eellison
2025-05-22 16:06:29 +00:00
Tom Ritchford
9a8c42ff94 Get rid of unused code in linters (#154043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154043
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-05-22 15:24:54 +00:00
eellison
35ddad284d update mutation renames (#153895)
Thanks to @PaulZhang12 for original find. When we finalize a multi template buffer, we need to reflect mutation renaming in dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153895
Approved by: https://github.com/PaulZhang12
2025-05-22 14:54:39 +00:00
Huy Do
6cd9d66b7f Allow higher fp16 tolerance for phlippe_resnet on CUDA 12.8 (#154109)
After https://github.com/pytorch/pytorch/pull/154004, one of the model `phlippe_resnet` needs higher tolerance for fp16 on CUDA 12.8.  I can reproduce it locally with:

```
python benchmarks/dynamo/torchbench.py --accuracy --timing --explain --print-compilation-time --inductor --device cuda --training --amp --only phlippe_resnet

E0522 02:47:12.392000 2130213 site-packages/torch/_dynamo/utils.py:2949] RMSE (res-fp64): 0.00144, (ref-fp64): 0.00036 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000, use_larger_multiplier_for_smaller_tensor: 0
```

I'm not sure what exactly happens behind the scene, but this should help fix the CI failure.

Also remove some left over expected accuracy results for CUDA 12.4 which we are not using anymore on CI for benchmark jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154109
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-22 14:25:12 +00:00
IvanKobzarev
4439255148 [aotd] Support saved tensors hooks in aot_autograd (#150032)
https://github.com/pytorch/pytorch/issues/148222

Goal:

At the moment autograd saved tensors hooks are run in eager after compiled forward.
They are executed at the same time for all saved tensors.
Hooks can be used to reduce amout of memory used for saved tensors, doing quantization or offloading to cpu.
This is suboptimal for optimization of peak memory.
Better solution will be to put the hooks in the graph, as close as possible to the last usage of the tensor.

To get user specified autograd saved tensors hooks in the graph.

Logic:

UX:
If user specifies with torch.autograd.graph.saved_tensors_hooks(pack_gm, unpack_gm).
Where pack_gm and unpack_gm are torch.fx.GraphModule.
Then AotAutograd will retrace those graph modules, doing decompositions and functionalization in aot_autograd, inlining the result graphs in forward epilogue and backward prologue.

User may want to use control logic in the hooks, for example applying quantization only for specific dtypes and sizes.

This is also possible, user can put it into torch.fx.wrap function and use symbolic trace to make a GraphModule.

In that case AotAutograd cahing will work only in case when user explicitly set to the torch.fx.wrap call_function node "user_cache_hash" metadata.

If this metadata set - then aot_autograd cache can use saved cache artifact.
If metadata is not set - then cache is bypassed.

Dynamo:
Dynamo traces pack and unpack hooks and installs them as subgraph and explicitly adds to the output_graph. (As those subgraphs are not used and will not be copied in the result by default).

The complexity here is that at this moment we do not have example of inputs for the hooks.
We trace  pack_hook with some Tensor from the inputs.
The result subgraphs are added to the hashing of AotAutograd Cache.

In AotAutograd we retrace the graph with the true saved tensors coming from partitioner.

Backwards Compatibility:
As current hooks are executed in eager mode and not all of them will be traceable - we only try to put in the graph hooks, explicitly marked by user with annotation (@_inlineable_saved_tensors_hooks).
For other hooks or if compiled autograd is enabled - keep the same logic.

Recompilations:
Hooks are guarded with lambda guard matching function id to cause recompilation if user reruns compiled function.

Aot_autograd:
After partitioner prepared forward and backward module - we trace prepared at Dynamo graphs for pack and unpack hooks and inline them in epilogue of forward and prologue of backward. Forward outputs and backward inputs are changed, transparently for user.

We do not try to put it close the last usage etc., relying on inductor to do this optimization.

```
INFO: TRACED GRAPH
 ===== Forward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph pre saved_tensors_hooks inlining 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2])
        return (view, add, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== saved_tensors_pack_hook add 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== saved_tensors_unpack_hook add 3 =====
 <eval_with_key>.22 from /data/users/ivankobzarev/a/pytorch/torch/fx/experimental/proxy_tensor.py:1225 in wrapped class pack_float8(torch.nn.Module):
    def forward(self, x_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(x_1, dtype = torch.float8_e4m3fn);  x_1 = None
        return (torch.float32, _to_copy)

INFO: TRACED GRAPH
 ===== Forward graph 3 =====
 /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", primals_3: "f32[s0, s1][s1, 1]cuda:0"):
         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6660 in simple_fn, code: x = x + 1
        add: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(primals_3, 1);  primals_3 = None

        # No stacktrace found for following nodes
        _to_copy: "f8e4m3fn[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add, dtype = torch.float8_e4m3fn)

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        view: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.view.default(add, [primals_1, primals_2]);  add = None
        return (view, _to_copy, primals_1, primals_2)

INFO: TRACED GRAPH
 ===== Backward graph 3 =====
 <eval_with_key>.21 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "Sym(s0)", primals_2: "Sym(s1)", add_packed_2: "f8e4m3fn[s0, s1][s1, 1]cuda:0", tangents_1: "f32[s0, s1][s1, 1]cuda:0"):
        # No stacktrace found for following nodes
        _to_copy: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten._to_copy.default(add_packed_2, dtype = torch.float32);  add_packed_2 = None

         # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6661 in simple_fn, code: x = SAF.apply(x)
        add_7: "f32[s0, s1][s1, 1]cuda:0" = torch.ops.aten.add.Tensor(tangents_1, _to_copy);  tangents_1 = _to_copy = None
        return (None, None, add_7)

```

Differential Revision: [D72187044](https://our.internmc.facebook.com/intern/diff/D72187044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150032
Approved by: https://github.com/bdhirsh
2025-05-22 14:09:38 +00:00
zeshengzong
f12d8d60b1 Add hint message when parameters is empty in clip_grad_norm_ (#151529)
Fixes #148259

## Changes

- Add print warning message when `parameters` generator exhausted

## Test Result
### print warning
```python

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

model = SimpleModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

inputs = torch.randn(16, 10)
targets = torch.randn(16, 1)

outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()

params_to_clip = model.parameters()

for p in params_to_clip:
    print(p.shape)

max_norm = 1.0
norm_type = 2.0
total_norm = nn.utils.clip_grad_norm_(params_to_clip, max_norm, norm_type)
print(f"total_norm: {total_norm}")
```

```bash
/home/zong/code/pytorch/torch/nn/utils/clip_grad.py:222: UserWarning: `parameters` is an empty generator, no gradient clipping will occur.
  warnings.warn(
total_norm: 0.0
```

### UT

```bash
pytest test/test_nn.py -k test_clip_grad_norm
```

![image](https://github.com/user-attachments/assets/0aa0f06c-e0a5-43cf-9a97-d7c2747c9180)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151529
Approved by: https://github.com/jbschlosser
2025-05-22 11:23:39 +00:00
leslie-fang-intel
40e6ca24ef Update CPU Inductor merge rules by adding more CPP Template (#152086)
**Summary**
Add more CPP Template into the CPU Inductor merge rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152086
Approved by: https://github.com/atalman
2025-05-22 09:46:26 +00:00
Aleksei Nikiforov
2f57ee579d S390x update docker image (#153619)
Add ninja-build for pytorch tests.
Switch to gcc 14 due to fix for precompiled headers and s390x vectorization interaction.
Disable -Werror when building onnxruntime.
Pin onnx version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153619
Approved by: https://github.com/huydhn
2025-05-22 09:34:46 +00:00
zeshengzong
d7a83ab67b Fix lr_scheduler unexpectedly calls step() when init argument last_epoch is larger than -1 (#149312)
Fixes #102261

## Changes

- Use flag `_is_initial` to replace `self.last_epoch == 0` condition to judge whether `lr` should be initial value
- Add test for `ExponentialLR` checkpoint usecase

## Test Result

```python
pytest -s test/optim/test_lrscheduler.py  -vv
```

![image](https://github.com/user-attachments/assets/6fd32bcc-b4fb-4421-b891-620bd4900dc1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149312
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-05-22 08:42:37 +00:00
Michael Lazos
423fc671e9 [Cutlass] Support float8_e4m3fn GEMM (#153890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-05-22 08:37:33 +00:00
Sidharth
c1b7dbc52a [dynamo] unimplemented -> unimplemented_v2 in variables/dict.py (#154040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154040
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
2025-05-22 06:46:10 +00:00
Yu, Guangye
a664cfdf95 Add C10_NODEPRECATED check for xpu (#153935)
# Motivation
Add `C10_NODEPRECATED` check for XPU. This doesn't allow xpu codebase to use `c10::optional`.

What's the change about torch-xpu-ops commit update?
Deprecate `c10::optional`, `c10::nullopt`, `c10::make_option`, use the counterpart in std instead.

# Additional Context
This PR depends on
https://github.com/intel/torch-xpu-ops/pull/1683
https://github.com/intel/torch-xpu-ops/pull/1690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153935
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-22 06:44:04 +00:00
Malaika
482e5b6660 [inductor] Added precompilation_timeout_seconds into a config instead of hardcoded (#153788)
Fixes #153392

- Updated config.py to add the timeout as a config var to be tuned dynamically (default is 3600s).
- Passed the var as a kwarg during call on instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153788
Approved by: https://github.com/henrylhtsang
2025-05-22 06:44:02 +00:00
Wei Wang
7128b50a65 [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it
https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
2025-05-22 06:33:29 +00:00
Laith Sakka
4bcff4af99 Move prologue_supported_inputs computations to def_kernal (#150869)
This avoid replaying load_input on a cache hit on the generate_code_cache.
the idea is that if a template have prologue_loads_all_inputs = True, it means that
all all inputs are loaded and hence no need to replay

Effect on the current benchmark on a local run on dev server.
18549985383 -> 15072230073
25697270062 -> 20738613297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150869
Approved by: https://github.com/eellison
2025-05-22 06:24:44 +00:00
Colin L Reliability Rice
4421aee558 torch.compile: Supress stdout / stderr output from subprocesses when local (#153837)
Summary:
This output is extremely noisy - i.e. on a 96 core machine, with 8 ranks, you
can get ~700 duplicate set of logs from each worker.

Differential Revision: D74907920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153837
Approved by: https://github.com/aorenste, https://github.com/masnesral
2025-05-22 05:49:43 +00:00
soulitzer
f2af30fee5 Add a HOP to bypass tracing of a wrapper function while tracing the wrapped function (#153487)
Usage:
```python
from torch._higher_order_ops.wrap import dynamo_bypassing_wrapper

# Your ordinary function wrapper
def my_hop_fn_impl(fn, *args, k=1, **kwargs):
    def wrapper(*args, **kwargs):
        out = fn(*args, **kwargs)
        if isinstance(out, tuple):
            return (out[0] + k,)
        return out + k

    return wrapper

# Calling `my_hop_fn` instead of the impl directly captures a HOP into the dynamo graph
def my_hop_fn(fn, *args, k=1, **kwargs):
    return dynamo_bypassing_wrapper(
        functools.partial(my_hop_fn_impl, k=k), fn, *args, **kwargs
    )
```

Notes:
- The dynamo captured graph now stashes arbitrary callable objects (the wrapper_fn) - this is equivalent to what SAC does today with policy_fn.
- The `wrapper_fn` passed to `dynamo_bypassing_wrapper ` should have signature `Callable -> Callable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153487
Approved by: https://github.com/ydwu4
2025-05-22 04:24:38 +00:00
Boyuan Feng
669b176d4c [Graph Partition] support removed arguments, NoneLayout, and mutation (#153899)
Graph partition relies on `read_writes` to collect partition inputs and outputs. There are three edge cases:

1. `NoneLayout` is not allocated so it cannot become a partition input or output.
2. Codegen may decide a buffer to be internal to a kernel (e.g., triton kernel). One example is some buffers internal to a FusedSchedulerNode. These buffers are never actually allocated as `buf_id`.
3. We should use mutation_real_name for graph partition inputs and outputs to match the behavior of other codegen.

This PR supports these 3 cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153899
Approved by: https://github.com/eellison
2025-05-22 04:24:31 +00:00
Yidi Wu
d1fe198df6 [cond] support output the same unbacked symbol from two branches (#148206)
Previously, we didn't track the unbacked symbols leaked out of true_branch and false_branch if they have the same shape expr. This cause the the fake output of cond operator itself doesn't set up its unbacked_bindings meta properly (because they're ignored).

In this PR, we also check whether there're leaked out unbacked symbols and create new unbacked symbols for it and track it as output of cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148206
Approved by: https://github.com/zou3519
2025-05-22 03:39:43 +00:00
Colin Peppler
fe285b9560 [aoti] fix corner case in unbacked replacements for atomically_apply_size_hint (#153768)
## PR
There are a few cases that my previous PR (#153220) didn't cover.
1. The LHS/RHS matters. Today, if you do `torch._check(lhs == rhs)` then it will show up as a deferred runtime assert with `Eq(lhs, rhs)`.
2. There can be transitive replacements. For example, expr1 -> expr2 -> u0. `test_size_with_unbacked_add_expr_transitive` tests for this.
3. An unbacked symint expr may not have a replacement that's purely a symbol, for instance, it could be another expression. `test_size_with_unbacked_add_and_mul_expr` tests for this.

## Device assertion msg

```
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
/tmp/tmp07mu50tx/6y/c6ym2jzadwfigu3yexredb7qofviusz3p7ozcdjywvayhxgcqxkp.py:40: unknown: block: [8681,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
```

## Autotuning code setup
This is the autotuning code for a concat kernel which takes input tensors (`in_buf`) and writes them to the (`out_buf`).

It's important to note the size of `in_buf0` is the same as `in_buf1` don't match along dim=0. This is bad because all concat inputs must share the same size for each dim except for the concat dim (here that's dim=1).
```
in_buf0 = generate_example_value(size=(u1 + s0, 256))   # concrete size is (17900, 256)
in_buf1 = generate_example_value(size=(u0, 10))         # concrete size is (8192, 10)
...
out_buf = generate_example_value(size=(u1 + s0, 266))   # concrete size is (17900, 256+10)
triton_poi_fused_cat_1.run(in_buf0, in_buf1, ..., out_buf, xnumel=(u1 + s0) * 266 ...)
```

If we look into the kernel code, you'll see that `tmp9` loads `in_buf1` (our incorrectly shaped input tensor). There is also a mask to prevent OOB loads.
- `tmp6`  makes sure we're only loading with the `xindex` from 256 to 264.
- `xmask` makes sure we're only loading with the `xindex` within `xnumel`.
- `tmp6 & xmask` together is essentially checking `0 ≤ x0 < u1 + s0` and `256 ≤ x1 < 264`.

The mask logic is correct, however, `in_buf1` has the shape `[8192, 10]` this means any load where `8192 ≤ x0 < u1 + s0` will be an OOB load.
```
def triton_poi_fused_cat_1(in_buf0, in_buf1, ... out_buf, xnumel, XBLOCK):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)
    xmask = xindex < xnumel
    x0 = (xindex % 264)
    x1 = xindex // 264
    ...
    tmp6 = x0 >= tl.full([1], value=256)
    tmp9 = tl.load(in_buf1 + (x1), tmp6 & xmask)
    # device assertion is thrown here
    tl.device_assert(((0 <= tl.broadcast_to(tmp13, [XBLOCK])) & (tl.broadcast_to(tmp13, [XBLOCK]) < ks0)) | ~(xmask & tmp6), "index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153768
Approved by: https://github.com/jingsh
2025-05-22 02:05:37 +00:00
Jiang, Yanbing
a264af8c71 Support fp8 output of _scaled_mm for CPU (#153600)
This PR is to support fp8 output of torch._scaled_mm for CPU, and create related UTs with fp8 and bf16/fp16/fp32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153600
Approved by: https://github.com/leslie-fang-intel, https://github.com/mingfeima, https://github.com/jansel
2025-05-22 01:15:39 +00:00
Gabriel Ferns
254293b777 Add flag _metrics_log_runtime to disable runtime metric logging by default (#153506)
https://github.com/pytorch/pytorch/pull/152708 expanded support of `get_estimated_runtime` to many more types of `SchedulerNodes`. This caused an increase in compile time because we're always calling `get_estimated_runtime` to populate the metrics table. This PR adds a flag for this logging, which reduces the instruction count by 8%. Long term, we should probably merge metrics.py with TORCH_LOGS/tlparse (suggestion from @xmfan).

Update: added support for TORCH_LOGS for the metrics logging.

Test Plan:
mm_loop.py and many existing tests cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153506
Approved by: https://github.com/eellison
2025-05-22 01:02:11 +00:00
PyTorch MergeBot
261897734a Revert "cpp_wrapper: build non-performance-sensitive code at O1 (#148773)"
This reverts commit 3c89cfd460.

Reverted https://github.com/pytorch/pytorch/pull/148773 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that pr_time_benchmark is regressed after this land ([comment](https://github.com/pytorch/pytorch/pull/148773#issuecomment-2899545140))
2025-05-22 00:11:14 +00:00
Max Podkorytov
7ef2c62fd3 [ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices (#152341)
This PR adds code generation for CK-tile based universal gemm kernels to the CK backend for Inductor, and adds these kernels to autotune choices.

Unlike legacy-CK based kernels (which are generated by parsing the CK instances from CK library), we generate the set of instances by manually specifying the tuning parameters.

This PR introduces a new template for code generation, and compilation/autotuning is handled by the existing infrastructure.

Points of discussion:

* For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime
* CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed
* Whether gemm problem size K is ever dynamic, since whenever it's not a compile-time constant, we need to perform a runtime dispatch between several kernels

** Testing **

Use the existing tests in `test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152341
Approved by: https://github.com/chenyang78
2025-05-21 23:59:16 +00:00
Ke Wen
87fc5af1f6 [c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055)
Work around issues like #153960, #152623

NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154055
Approved by: https://github.com/atalman
2025-05-21 23:46:52 +00:00