Commit Graph

91573 Commits

Author SHA1 Message Date
Zain Rizvi
0d71ca2c46 [EZ] Replace pytorch-labs with meta-pytorch (#160459)
This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'

## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files

## Files Modified
This PR updates files that contained the target text.

Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
2025-08-12 22:44:25 +00:00
deedongala
5737372862 [CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882)
Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list

Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners

This should help increase available runners even with same number of CI nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 22:42:40 +00:00
Isalia20
2e4e5ab4be [MPS] Add mps keys to indices and values ops (#160223)
enable indices and values on sparse mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160223
Approved by: https://github.com/malfet
2025-08-12 22:08:44 +00:00
Zhengxu Chen
16d15445f8 Fullgraph graph capture with dynamo. (#159749)
Summary:
Following up on Avik's doc https://docs.google.com/document/d/11RW0Bbkp1QwFbEu8rCNW5d7wUFaEkxbL0uLyqcc2jTk/edit?tab=t.0

We are experimenting with a new API which utilizes torch.compile(fullgraph=True) and intend to use it to replace the old dynamo.export() API.

This PR adds a prototype for the API described in the doc.

Test Plan:
test_misc -- -k test_aot_capture

Rollback Plan:

Differential Revision: D79534608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159749
Approved by: https://github.com/tugsbayasgalan
2025-08-12 22:06:18 +00:00
henrylhtsang
101276f81b [BE] Save attributes for CppCompileError for pickleing (#160294)
Differential Revision: [D79977408](https://our.internmc.facebook.com/intern/diff/D79977408/)

Context:
When testing cutlass backend and used autotune with subproc, sometimes I would see C++ compilation error (expected) followed by
```
Traceback (most recent call last):
  File "/torch/_inductor/autotune_process.py", line 175, in get
    result = TuningProcess.recv(self.read_pipe)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/autotune_process.py", line 99, in recv
    return pickle.load(read_pipe)
           ^^^^^^^^^^^^^^^^^^^^^^
TypeError: CppCompileError.__init__() missing 1 required positional argument: 'output'
```
which is unexpected. After asking claude, it seems

> Now I can see the issue. The `CppCompileError` class requires two arguments: `cmd` (a list of strings) and `output` (a string). However, when exceptions are being pickled and unpickled across process boundaries, the pickling process might not be preserving the constructor arguments correctly.
>
> The problem is likely that when a `CppCompileError` is raised in the subprocess and then pickled/unpickled through the `recv` function, the unpickling process is trying to reconstruct the exception but doesn't have the required constructor arguments.
>
> The issue is clear now. The `CppCompileError` class doesn't have custom pickle methods (`__reduce__`, `__getstate__`, `__setstate__`), so when it's pickled and unpickled across process boundaries, Python's default pickling mechanism tries to reconstruct it but fails because it doesn't preserve the constructor arguments properly.
>
> The solution is to add a `__reduce__` method to the `CppCompileError` class to ensure it can be properly pickled and unpickled. Let me implement this fix:

Adding these seem to help.

fbcode repro: [D79977541](https://www.internalfb.com/diff/D79977541)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160294
Approved by: https://github.com/masnesral
2025-08-12 22:03:36 +00:00
drisspg
cbffde7745 Factor out the strings to templates for better editor integration (#160357)
# Summary

More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting

Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b" />

After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-12 21:59:54 +00:00
David Berard
78a2fe1d42 [TorchScript] thread-safe ErrorReport::CallStack (#160386)
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160386
Approved by: https://github.com/eellison
2025-08-12 21:59:04 +00:00
Ivan Zaitsev
f8f0414a59 fix cpp builder to avoid missing-source compile error (#160354)
Summary:
the condition
```
if config.is_fbcode() and (not self._aot_mode or self._use_relative_path):
    sources = [os.path.basename(i) for i in sources]
```
unintentionally (?) stripped paths even when use_relative_path was False (as long as aot_mode was False), breaking local tests that rely on absolute temp-file paths.

Fixes internal issue:
```

FAILED (errors=1)

CppCompileError: C++ compile error

Command:
/mnt/gvfs/third-party2/llvm-fb/0f1f083aa5508772f3db24bf4f697bc118ba0958/17/platform010/72a2ff8/bin/clang-17 czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp -shared -fPIC -O3 -DNDEBUG -fno-trapping-math -funsafe-math-optimizations -ffinite-math-only -fno-signed-zeros -fno-math-errno -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -Werror=ignored-optimization-argument -g -o /re_tmp/tmpsp58ya2h/zy/test_symbol.so

Output:
clang-17: error: no such file or directory: 'czyi3nhzin5b3mc3376vmfnlbjobvjcghbvv4tatuazs3syqubay.cpp'
clang-17: error: no input files
```

Reviewed By: clee2000

Differential Revision: D80025417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160354
Approved by: https://github.com/benjaminglass1, https://github.com/clee2000
2025-08-12 21:36:22 +00:00
Mikayla Gawarecki
4d419a7461 Add pad and narrow to torch/csrc/stable/ops.h (#159328)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159328
Approved by: https://github.com/janeyx99
ghstack dependencies: #159507
2025-08-12 21:29:49 +00:00
Mikayla Gawarecki
655137b678 Update torch::stable::Tensor() default constructor (#159507)
Allows things like

```cpp
Tensor cu_seqlens_q;
if (...) {
   cu_seqlens_q = ...
}
...
```

Also adds `torch::stable::Tensor.defined()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159507
Approved by: https://github.com/janeyx99
2025-08-12 21:29:49 +00:00
Gheorghe-Teodor Bercea
f27232a213 [ROCm] Limit number of values per thread for reductions on three dimensions (#159652)
In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652
Approved by: https://github.com/jeffdaily
2025-08-12 21:15:56 +00:00
Anshul Sinha
c24ca7f4bf [FSDP][Collectives] skipping allgather when world size is 1 (#160135)
**Summary:** In its current state, FSDP collectives uses cuda synchronizations and communication ops regardless of what the world size is. However, now that replicate will use FSDP, there will be instances where group size = 1 and these synchronizations and ops will be used needlessly. I have updated fsdp_params group to skip the foreach_all_gather and foreach_all_gather_copy_out APIs when world_size ‎ = 1. I have created a test that uses CommDebugMode to verify that the all gather comm has been removed. I also edited an affected test which used 1-way FSDP by verifying and changing its assert statements for CommDebugMode. Below, I have included the link to the profile trace verifying these two APIs were skipped and two test commands.

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/anshulsi_f846ac3b-9467-4060-8e36-8cc3bc4449c3_devgpu263.prn2.facebook.com_652183.1753822140871934814.pt.trace.json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160135
Approved by: https://github.com/weifengpy
2025-08-12 21:13:29 +00:00
AaronWang04
b4596895b9 [DTensor] Registers sharding rule for rms_norm (#159692)
Reduces collective calls in the forward pass from 2 to 1

In #158716 I added the sharding rule for the backward pass but didn't add the forward pass as it didn't get dispatched. After #159324 this should get properly dispatched hence I am adding it now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159692
Approved by: https://github.com/tianyu-l
2025-08-12 21:05:24 +00:00
xinan.lin
5a9c4cfce4 [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#160403)
Fixes #160243, Fixes #160244, Fixes #160245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160403
Approved by: https://github.com/janeyx99
2025-08-12 21:02:44 +00:00
Chien-Lin Chen
a354fa91e2 added class or module info for functions blocked by weight-only load (#159935)
Fixes #152985
In #152985, users are confused why weights-only load failed even though functions were registered in safe_globals.
Because the error message doesn't make the critical failure reason clear, they couldn't figure out only some functions are missing from safe_globals registration.
This fix is to make that point more clear.

Here's the new errror message, the blocked function information will be following the warning message with a line breaker to make it stand out.
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error:

Trying to call reduce for unrecognized function <built-in method _unpickle of type object at 0x641e8a57d1f0> which belongs to <class 'zoneinfo.ZoneInfo'>

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

To execute this test, run the following from the base repo dir:
    python test/test_serialization.py TestSerialization.test_weights_only_with_safe_zoneinfo_unpickle_registration_success

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159935
Approved by: https://github.com/mikaylagawarecki
2025-08-12 20:52:25 +00:00
Ankita George
f95b58c284 Remove usage of fsspec in HF consolidation script (#159392)
Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only

Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392
Approved by: https://github.com/sibuachu
2025-08-12 20:41:06 +00:00
albanD
8e6a313858 Add ownership token when needed on GradientEdge (#160098)
We can avoid the token by introducing PyObject preservation for THPFunction. But I think it will be too much complexity given that this kind of issue is very rare.
Happy to be talked into doing it though if someone really wants to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160098
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-08-12 20:14:18 +00:00
Paul de Supinski
7e91394955 Support NUMA Binding for Callable Entrypoints (#160163)
# Context
This is an extension of #149334.

# This PR
Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`.

Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary.

Other changes:
* Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).)
* Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran

```
$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt
```

and observed
* 6.6% remote memory accesses with 'node' bindings
* 11.6% remote without bindings

I also ran similar with `str` entrypoints as before just to be sure it's still working.

NOTE: [--run-path triggers the code to be run inside a `Callable`.](017259f9c6/torch/distributed/run.py (L870))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163
Approved by: https://github.com/d4l3k
2025-08-12 20:08:49 +00:00
Markus Hoehnerbach
89654db1ab [inductor] fix triton bucketize mask propagation (#159961)
See 6b414f56a4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159961
Approved by: https://github.com/eellison
2025-08-12 19:59:32 +00:00
Natalia Gimelshein
2d0cdee394 move thread-local capture mode guard to include work.isStarted (#160398)
Per title, should fix capture errors that happen because nccl watchdog races with capture start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160398
Approved by: https://github.com/aorenste
2025-08-12 19:25:04 +00:00
eqy
9903ca4f70 [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-08-12 18:07:41 +00:00
PyTorch MergeBot
f341077ce4 Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281)"
This reverts commit a7abf57aab.

Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](a7abf57aab).  Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))
2025-08-12 17:57:57 +00:00
Edward Z. Yang
3cec82a7e9 Ensure outer aliasing on DTensor matches inner aliasing (#158954)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158954
Approved by: https://github.com/albanD, https://github.com/wconstab
2025-08-12 17:47:48 +00:00
Jerry Mannil
ee9f8ba11d [ROCm] Use opportunistic fastatomics based on hueristics (#159430)
* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-08-12 17:13:54 +00:00
David Berard
1f4057c11a [inductor] remove no_x_dim (#159810)
no_x_dim is used to indicate that a reduction operates on a single row, and data loaded for the reduction is 1-dimensional.

no_x_dim was introduced in https://github.com/pytorch/pytorch/pull/102444 - in which there was bad perf in some reductions, and using 1D tensors fixed the perf issue.

However, it appears that this perf issue no longer exists in current Triton versions. https://github.com/pytorch/pytorch/pull/118822 checked this, and we can also check this on H100 benchmarks (linked below). And another motivation for removing this behavior is that it enables larger loads, which we observe is necessary for good performance on certain shapes on Blackwell.

H100 inference benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

H100 training benchmarks:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Aug%202025%2004%3A13%3A24%20GMT&stopTime=Mon%2C%2011%20Aug%202025%2004%3A13%3A24%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/396/orig&lCommit=a6bcd4692fb39fa2fad260f290bff545d4425829&rBranch=main&rCommit=e96c7c4bb0f6aeae2ab3b6f040f7d67edbec199a

Overall, the benchmarks show minimal change in performance.

Differential Revision: [D79599286](https://our.internmc.facebook.com/intern/diff/D79599286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159810
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-08-12 17:10:31 +00:00
Jovian Anthony Jaison
94b91a8763 [redone][pytorch] Moving torch.compile worker process logs to a dedicated rank based log directory (#160352)
Summary:
Writing torch.compile worked logs to dedicated_log_rank{RANK} if we're running on mast.
ref: D79456310 (got reverted because of linter)

Testing:
Refer differential Revision: D79917440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160352
Approved by: https://github.com/masnesral
2025-08-12 16:49:08 +00:00
Xinya Zhang
a7abf57aab [ROCm] Support large inputs for coalesceValuesKernel (#158281)
# Description

`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.

This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.

Confirmed the new approach can handle large inputs. Correctness needs validation.

# Testing Command

`python torch_spmv.py 22500000 272500000`

## Script `torch_spmv.py`

``` python
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
    )
    parser.add_argument("n", type=int, help="Size of the NxN matrix")
    parser.add_argument("nnz", type=int, help="Number of non-zero entries")
    return parser.parse_args()

def main():
    args = parse_args()
    n = args.n
    nnz = args.nnz
    dtype = torch.float32
    device = torch.device('cuda')

    # Generate random indices for the sparse matrix in COO format.
    torch.manual_seed(42)
    rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    indices = torch.stack([rows, cols], dim=0)

    # Generate random values.
    values = torch.randn(nnz, dtype=torch.float32, device=device)

    # Create the sparse COO matrix and move it to the target device.
    sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
    sparse_matrix = sparse_matrix.coalesce()

    # Generate a random dense vector.
    dense_vector = torch.randn(n, dtype=torch.float32, device=device)

    # Perform sparse matrix - dense vector multiplication.
    # Using torch.sparse.mm which expects a 2D tensor for the vector.
    result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
    # result = torch.mv(sparse_matrix, dense_vector)

    # Print the result.
    print("Result of the multiplication:")
    print(torch.sum(result))

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-08-12 16:42:55 +00:00
PyTorch MergeBot
f7b2f3314c Revert "[triton_heuristics] Optimize the triton launcher in pt2 (#160000)"
This reverts commit d0e2240f68.

Reverted https://github.com/pytorch/pytorch/pull/160000 on behalf of https://github.com/davidberard98 due to D80054972 failing with test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1_tdlp_1 ([comment](https://github.com/pytorch/pytorch/pull/160000#issuecomment-3180144676))
2025-08-12 16:33:02 +00:00
Jeff Daily
9d37c960a4 [ROCm][CI] use new benchmark image for dynamo (#160421)
Follow-up to #160047 that separated the rocm image into default CI and benchmarks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160421
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 16:07:19 +00:00
PyTorch MergeBot
b219ca2a00 Revert "Update triton xpu commit to support python 3.14 (#160183)"
This reverts commit 7fbc22855c.

Reverted https://github.com/pytorch/pytorch/pull/160183 on behalf of https://github.com/clee2000 due to I'm not sure how, but it seems to have broken inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration [GH job link](https://github.com/pytorch/pytorch/actions/runs/16911267995/job/47917091939) [HUD commit link](7fbc22855c).  Maybe because the docker build changed?  Note to self: not bad TD ([comment](https://github.com/pytorch/pytorch/pull/160183#issuecomment-3179840160))
2025-08-12 15:29:19 +00:00
atalman
b7db86600a Fix Tensor illustration, use permalinks for image embedding in Readme.md (#160416)
Fixes Tensor illustration being broken on pypi.org. Also uses permalinks instead of links to images for embedding as per this suggestion of Alban: https://github.com/pytorch/pytorch/pull/160187#discussion_r2262978006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160416
Approved by: https://github.com/malfet
2025-08-12 15:15:12 +00:00
James Wu
9708fcf92d Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120)
This PR fixes a bug where user defined triton kernels hidden behind `triton_op` do not register source code changes. If a user *only* changes a triton kernel source_code, because triton kernels are hidden under the custom op, dynamo hasn't traced into them yet.

This means at AOTAutograd time, we don't know the list of triton kernels that are defined by custom ops. This is an initial fix for the issue by parsing the AST of the custom op looking for triton kernels. This won't catch more degenerate cases if the custom op calls other custom ops/functions that then call triton kernels, and then the toplevel compiled graph doesn't know about it. To handle that, we'd have to trace through the custom op at dynamo time.

This should handle 99% of cases, though. I added an expectedFailure test to show the limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160120
Approved by: https://github.com/zou3519
2025-08-12 14:11:06 +00:00
Wang, Chuanqi
a288b15ea9 [CI] Reduce XPU Windows build time (#159763)
Reduce the time cost from 2.5 hours to about 1.5 hours.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159763
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:04:29 +00:00
Wang, Chuanqi
7fbc22855c Update triton xpu commit to support python 3.14 (#160183)
Follow PR #159725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160183
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-08-12 14:02:36 +00:00
IvanKobzarev
f33ce40bc0 [bucketing] Bucket only adjacent collectives to prevent reordering (#159983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983
Approved by: https://github.com/wconstab, https://github.com/eellison
2025-08-12 11:57:00 +00:00
Animesh Jain
4d5b3f2d5a [dynamo][guards] Install dict watchers for recrusive dict tag optimization (#159796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159796
Approved by: https://github.com/jansel
2025-08-12 09:49:11 +00:00
zeshengzong
f990490a23 Add label_smoothing param in nn.BCELoss and nn.BCEWithLogitsLoss (#150282)
Fixes #91545

## Changes

- Add `label_smoothing` param and docs
- Add test case for `label_smoothing`
- Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss`

##  Test Result

```bash
pytest -s test/test_nn.py -k test_bce
```

![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05)

![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964)

![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282
Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki
2025-08-12 09:37:03 +00:00
morrison-turnansky
b9003ed3d8 Dynamo Deep Dive Documentation Fix (#158860)
changed SourceBuilder to VariableBuilder

Fixes #158447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860
Approved by: https://github.com/mlazos
2025-08-12 08:53:33 +00:00
Laith Sakka
fea7e9dd37 extract shape in _view_has_unbacked_input (#160255)
Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..]

Test Plan:
existing tests.

Rollback Plan:

Differential Revision: D79951119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255
Approved by: https://github.com/bobrenjc93
2025-08-12 08:38:19 +00:00
Jovian Anthony Jaison
9a0f7a3bb0 [retry-land][pytorch][dynamo_compile] Log stack_trace to dynamo_compile (#160348)
refer: https://github.com/pytorch/pytorch/pull/159655

Earlier pr failed on dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed.
Updated test_dynamo_timed + re-ran locally to test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160348
Approved by: https://github.com/masnesral
2025-08-12 06:24:54 +00:00
Animesh Jain
01bcf9a40d Bump transformers pin (#159291)
Trying to update hf pin.

Benchmarking run to figure out issues

<img width="1356" height="123" alt="image" src="https://github.com/user-attachments/assets/fbc435f3-a7cb-4280-9636-2ea6d15d7b6d" />

Retrying - https://github.com/pytorch/pytorch/pull/156118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159291
Approved by: https://github.com/BoyuanFeng, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-08-12 05:14:17 +00:00
Animesh Jain
8d3d1c8443 [dynamo] fixes to propagate tag safeness (#159807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159807
Approved by: https://github.com/jansel
2025-08-12 04:50:13 +00:00
PyTorch UpdateBot
0f3b10b8ee [audio hash update] update the pinned audio hash (#160384)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160384
Approved by: https://github.com/pytorchbot
2025-08-12 04:38:04 +00:00
Boyuan Feng
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
Nikita Shulga
edaa151d0d [CI] Move CUDA tests to trunk workflow (#160379)
Which is getting run before PR is merged anyway, but according to 3X
less frequently than pull workflow according to [Flambeau](https://pytorchci.grafana.net/public-dashboards/1c571e79090443eaaa9811db71f8d23b)
<img width="796" height="573" alt="image" src="https://github.com/user-attachments/assets/0235e610-4e1c-4be5-88bf-ea8278d1c656" />

I.e. that will probably results in some longer time to signal, but considering that frequency of changes to eager PyTorch-on-CUDA slowed down and Inductor changes are decorated with ciflow/inductor, this looks like an acceptable tradeoff to reduce costs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160379
Approved by: https://github.com/izaitsevfb
2025-08-12 04:23:50 +00:00
rzou
10bc36fe84 Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341)
Short-term fix for https://github.com/pytorch/pytorch/issues/160333

The problem is:
1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation
2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented.
3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition.

The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented
return logic into the decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341
Approved by: https://github.com/drisspg
2025-08-12 04:09:37 +00:00
PyTorch UpdateBot
32e5e2f596 [vllm hash update] update the pinned vllm hash (#160259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160259
Approved by: https://github.com/pytorchbot
2025-08-12 04:04:53 +00:00
Scott Todd
bfc873d02e [ROCm][Windows] Revert copying hipblaslt and rocblas dirs. (#159083)
This reverts the changes from b367e5f6a6. This will also close https://github.com/pytorch/pytorch/pull/158922.

Since 30387ab2e4, ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md), so they do not need to be bundled into torch/lib.

There was also a bug in here - if `ROCM_DIR` is unset, the code crashes:
```
  File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command
    cmd_obj.run()
  File "D:\b\pytorch_main\setup.py", line 853, in run
    rocm_dir_path = Path(os.environ["ROCM_DIR"])
                         ~~~~~~~~~~^^^^^^^^^^^^
  File "<frozen os>", line 714, in __getitem__
KeyError: 'ROCM_DIR'
```
The code could have checked for `ROCM_PATH` too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083
Approved by: https://github.com/jeffdaily
2025-08-12 02:45:49 +00:00
Scott Todd
eed9dbf70f [ROCm] Add torch/_rocm_init.py to .gitignore. (#159806)
Follow-up to https://github.com/pytorch/pytorch/pull/155285.

Build scripts like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py generate this file with contents like:

```python
def initialize():
    import rocm_sdk
    rocm_sdk.initialize_process(
        preload_shortnames=['amd_comgr', 'amdhip64', 'hiprtc', 'hipblas', 'hipfft', 'hiprand', 'hipsparse', 'hipsolver', 'hipblaslt', 'miopen'],
        check_version='7.0.0rc20250804')
```

We may also have https://github.com/pytorch/pytorch/blob/main/tools/amd_build/build_amd.py do the same thing as more of that build support moves here into the upstream PyTorch repository itself (see https://github.com/pytorch/pytorch/issues/159520).

This file is then loaded if present here: a7f3bdf550/torch/__init__.py (L145-L157)

Given that the file is generated by build scripts, I think adding it to `.gitignore` makes sense, as that will prevent accidental check-ins and keep local history cleaner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159806
Approved by: https://github.com/jeffdaily
2025-08-12 02:24:21 +00:00
Natalia Gimelshein
be53f609aa fix retaining multimem in symmetric memory (#160343)
fixes OOM in #160289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160343
Approved by: https://github.com/eqy
2025-08-12 02:03:20 +00:00