Commit Graph

3475 Commits

Author SHA1 Message Date
PyTorch MergeBot
1e0656f063 Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit de893e96c7.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
PyTorch MergeBot
104f2680e0 Revert "Add return-max-scores to flex-attention (#161667)"
This reverts commit 486b20b73c.

Reverted https://github.com/pytorch/pytorch/pull/161667 on behalf of https://github.com/huydhn due to Sorry for reverting your change but reverting https://github.com/pytorch/pytorch/pull/161730 does not seem to fix all trunk failures ([comment](https://github.com/pytorch/pytorch/pull/161667#issuecomment-3263512642))
2025-09-07 06:00:55 +00:00
Edward Z. Yang
b2b4add0e7 Docs on export joint with descriptors (#159006)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159006
Approved by: https://github.com/SherlockNoMad
2025-09-06 03:02:58 +00:00
drisspg
486b20b73c Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-05 23:21:46 +00:00
Edward Yang
de893e96c7 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-05 20:15:11 +00:00
PyTorch MergeBot
adae7f66aa Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit c37103234a.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
Edward Yang
c37103234a Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-04 19:43:17 +00:00
Frank Lin
0c0e056a9e [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel, https://github.com/eqy

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-04 17:21:26 +00:00
William Wen
f36f285953 [dynamo] change error_on_graph_break/fullgraph semantics (#161747)
This PR implements the semantics change to `torch._dynamo.error_on_graph_break`:
- ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~
- `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks.
- `error_on_graph_break` does nothing when `fullgraph=True`
- `error_on_graph_break` does NOT guarantee a single graph

Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation:
- `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled
- `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time
- `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747
Approved by: https://github.com/mlazos
ghstack dependencies: #161739
2025-09-04 17:10:17 +00:00
PyTorch MergeBot
b7dad7dd49 Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit 90b08643c3.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358))
2025-09-04 15:25:07 +00:00
Saurabh Mishra
1281470155 [DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682)
This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced:
- Configuration the target DType and the block size.
- Multi threaded dequantization for efficiency

Test Plan:
buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage
```
Time elapsed: 2:34.1s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D80174674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682
Approved by: https://github.com/ankitageorge
2025-09-04 01:09:53 +00:00
Edward Yang
90b08643c3 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-03 07:33:55 +00:00
FFFrog
d789451ff6 [OpenReg] Migrate Accelerator Document from source/notes into source/accelerator (#161845)
As the tile stated.

As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845
Approved by: https://github.com/albanD
2025-09-03 03:12:18 +00:00
PyTorch MergeBot
4e42aa8ffc Revert "Always build USE_DISTRIBUTED. (#160449)"
This reverts commit b7034e9c92.

Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))
2025-09-02 20:28:42 +00:00
Justin Chu
524b78d4f6 [ONNX] Refactor torchscript based exporter (#161323)
Refactor torchscript based exporter logic to move them to a single (private) location for better code management. Original public module and method apis are preserved.

- Updated module paths in `torch/csrc/autograd/python_function.cpp` accordingly
- Removed `check_onnx_broadcast` from `torch/autograd/_functions/utils.py` because it is private&unused

@albanD / @soulitzer could you review changes in `torch/csrc/autograd/python_function.cpp` and
`torch/autograd/_functions/utils.py`? Thanks!

## BC Breaking
- **Deprecated members in `torch.onnx.verification` are removed**

Differential Revision: [D81236421](https://our.internmc.facebook.com/intern/diff/D81236421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161323
Approved by: https://github.com/titaiwangms, https://github.com/angelayi
2025-09-02 16:10:30 +00:00
Dev Sashidhar
d5e0f4202b Fixes broken memory_viz link in CUDA memory docs (#161426)
Fixes #161375

The  "Using the visualizer" section in torch_cuda_memory.md had a link to  https://pytorch.org/memory_viz written in inline Markdown link form. Strangely the same syntax worked earlier on the page as the issuer mentioned, but in this spot it's rendered sa a broken link.

I wasn't able to pinpoint why the second occurrence was treated differently, but switching it to the Markdown autolink form fixes the problem consistently. I tested this by rebuilding the docs locally with make html and serving the HTML with a local http.server. With the autolink, the link resolves correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161426
Approved by: https://github.com/soulitzer
2025-09-02 02:06:54 +00:00
Edward Yang
b7034e9c92 Always build USE_DISTRIBUTED. (#160449)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
2025-09-01 23:00:21 +00:00
PyTorch MergeBot
63a9c23fe9 Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)"
This reverts commit 190c391a28.

Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))
2025-09-01 16:27:03 +00:00
Frank Lin
190c391a28 [CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352)
## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352
Approved by: https://github.com/ngimel

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-01 09:25:01 +00:00
Zheng, Zhaoqiong
6737e2c996 update supported OS for Intel client GPU (#161699)
update supported OS for Intel client GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161699
Approved by: https://github.com/chuanqi129, https://github.com/malfet
2025-09-01 05:45:09 +00:00
Paul de Supinski
768a1017c5 Allow parallel start NUMA binding (#161576)
# Context
In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`.

However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff).

The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start:

> Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.

But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer.

I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07)

The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread.

# This PR
Remove restrictions against parallel start for NUMA binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576
Approved by: https://github.com/d4l3k
2025-08-28 01:15:58 +00:00
FFFrog
d2db6c86b0 [OpenReg] Add Develop Notes for Integrating New Backend into PyTorch (#158644)
To facilitate the integration of the new backend, we plan to publish a new development note that details all the key components,hoping to speed up the development of other accelerators.

This PR is the beginning of this note, and involve the part of registration of operators and we will gradually improve it and keep in sync with OpenReg's code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158644
Approved by: https://github.com/albanD
2025-08-27 14:47:25 +00:00
PyTorch MergeBot
9f6e1b8730 Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864)"
This reverts commit 3caddd4daa.

Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))
2025-08-26 20:03:59 +00:00
Will Constable
e3d68dfae2 [DTensor] Make default RNG semantics match user-passed generator (#160482)
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.

After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).

The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator.  The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.

Fixes #159991

confirmed docs rendered OK

<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
2025-08-25 04:21:19 +00:00
Chuanhao Zhuge
74280d0913 [muon] Introduce Muon optimizer to PyTorch (#160213)
A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy.

This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory.

**Usage**

```python
    model = MyModelForCausalLM
    # filter out your params manually
    muon_params = [...]
    adamw_params = [...]
    muon = Muon(
        params = muon_params
        lr=lr,
        wd=wd,
    )
    adamw = AdamW(
        params = adamw_params
        lr=lr,
        wd=wd,
    )

    # in training loop
    loss = model(input)
    loss.backward()
    muon.step()
    adamw.step()
    muon.zero_grad()
    adamw.zero_grad()
```

~~**Additional usage**~~
~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~

```python
~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~
~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~
```

As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs.

By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW.

**Testing**

~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~

As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test.

2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.
<img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" />

**Numerics**
We evaluate our implementation with existing implementation to confirm numerical consistency.

As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap.

As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below.

```python
    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = KellySingleDeviceMuon(
        params=model0.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)
```

As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without.
<img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" />

**Performance**
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:

- adamw_ddp finishes in 13.12 min
- pytorch_muon_ddp finishes in 13.45 min

Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is *2.5%* slower than AdamW.

AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms
<img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" />

Muon: Optimizer.step() takes ~54 ms, step time ~960 ms
<img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" />

**Note**
We restrict the implementation to accept only 2D parameters.

An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful.

Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound.

**Next Steps**

1. Add `MuP`
2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
3. Open-source unsharded Muon co-designed with FSDP2.

****

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213
Approved by: https://github.com/janeyx99
2025-08-24 08:03:04 +00:00
Paul de Supinski
33346b5814 Support NUMA Binding for Callable Entrypoints, Take 2 (#161183)
# Context
In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges.

Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set.

# This PR
On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836)

Specifically, for each local rank `i`:
1. The parent process sets its own CPU affinity to what local rank `i`'s should be.
2. Then, the parent spawns the subprocess for local rank `i`.
3. Finally, the parent resets its own CPU affinity to what it was originally.

There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple.

This required a bit of refactoring:
1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`.
2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.).
3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
4. Use the context manager for both `str` and `Callable` paths

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183
Approved by: https://github.com/d4l3k
2025-08-23 07:23:22 +00:00
Justin Chu
419a2dbf5f [ONNX] Remove enable_fake_mode and exporter_legacy (#161222)
Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222
Approved by: https://github.com/titaiwangms
2025-08-22 22:15:27 +00:00
PyTorch MergeBot
c7a77470c5 Revert "[DTensor] Make default RNG semantics match user-passed generator (#160482)"
This reverts commit d1faf2ef04.

Reverted https://github.com/pytorch/pytorch/pull/160482 on behalf of https://github.com/jeffdaily due to failing cuda and rocm jobs ([comment](https://github.com/pytorch/pytorch/pull/160482#issuecomment-3214694297))
2025-08-22 15:04:28 +00:00
Will Constable
d1faf2ef04 [DTensor] Make default RNG semantics match user-passed generator (#160482)
Previously, DTensor kept its own copy of the generator state after the
first time a random operator was called on a DTensor. This copy would
evolve independently from the generator outside of DTensor.

After adding support for users to pass a specific generator into
random operators (e.g. `uniform_(..., generator=)`), it was determined
(in discussion on #159991) to change the semantics so that any random
operations performed on DTensor would evolve the state of the publicly
visible generators (either the default one or user-passed one).

The upsides are (1) it is now possible to call torch.manual_seed() at
any point in the program and have a consistent effect on DTensor, (2)
DTensor ops have an observable effect on the generator.  The downside is
that users are now responsible for seeding their generator before using
DTensor, ensuring all ranks use the same seed.

Fixes #159991

confirmed docs rendered OK

<img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482
Approved by: https://github.com/wanchaol
2025-08-21 22:02:16 +00:00
Andy Lugo
3caddd4daa [ROCm] SDPA fix mem fault when dropout is enabled (#154864)
Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-21 14:23:13 +00:00
Jane Xu
8f766d6839 Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557)
TL;DR: Moving to ScalarType in user extensions and removing deprecated dtypes.

This change _modifies_ the from/to behavior between ScalarType and StableValue! Whereas before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts, now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear.

Then we add a Tensor scalar_type API which reuses the from/to logic to return to the user a nice ScalarType (vs an abstracted int32_t).

I then changed the test to test the scalar_type API.

This code change required some refactoring because of circular dependencies.

## BC Breaking note
This commit is (narrowly) BC-breaking for unpopular dtypes: `quint*`s, `qint*`s, `Bits*`, `dummy_uint*`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the narrow use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`. As of now, I believe there are 0 users of this use case, so the benefits of this change significantly justify BC-breaking this API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160557
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2025-08-19 22:13:47 +00:00
PyTorch MergeBot
eba20d2d74 Revert "[WIP] Merge Test (#160998)"
This reverts commit ef761c4353.

Reverted https://github.com/pytorch/pytorch/pull/160998 on behalf of https://github.com/ZainRizvi due to Undoing test merge ([comment](https://github.com/pytorch/pytorch/pull/160998#issuecomment-3202125839))
2025-08-19 20:30:39 +00:00
John Stawinski
ef761c4353 [WIP] Merge Test (#160998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160998
Approved by: https://github.com/ZainRizvi
2025-08-19 20:26:07 +00:00
FFFrog
284b719005 Remove the uncessary empty file (#160728)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160728
Approved by: https://github.com/Skylion007
2025-08-19 10:54:08 +00:00
henrylhtsang
98373e5ad2 [doc] AOTI debugging guide (#160430)
Folded from https://discuss.pytorch.org/t/a-beginners-guide-to-debugging-aot-inductor-cuda-illegal-memory-access/222188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160430
Approved by: https://github.com/angelayi
2025-08-14 23:42:17 +00:00
Howard Huang
198b5fd2d4 [PP] Add DualPipeV schedule (#159591)
Added the DualPipeV schedule according to http://github.com/deepseek-ai/DualPipe/blob/main/dualpipe/dualpipev.py#L11

<img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/4e843bb9-87cd-4d11-936c-7dfe8ee12f16" />

This schedule doesn't perform the actual "overlap" during execution, but provides the scaffolding and schedule definition we need to run it E2E in torchtitan. Supporting the overlapped operation will be worked on in following PRs.

Tests:
```sh
python test/distributed/pipelining/test_schedule_multiproc.py -k test_v_shape_schedules
python test/distributed/pipelining/test_schedule.py -k test_pipeline_order_for_v_schedules
```

Also tested in TorchTitan and is running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159591
Approved by: https://github.com/wconstab
2025-08-14 14:58:35 +00:00
RajeshvShiyal
5ace061254 finfo eps doc fix (#160502)
Existing documentation for torch.finfo().eps is as below:
| eps             | float  | The smallest representable number such that ``1.0 + eps != 1.0``.          |

Proposed documentation for torch.finfo().eps is as below:
| eps             | float  | The difference between 1.0 and the next smallest representable float larger than 1.0.	|

Fixes #160397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160502
Approved by: https://github.com/ngimel
2025-08-14 01:49:35 +00:00
Mikayla Gawarecki
1196bb1c2e Add utility to get computed kernel in torch.library (#158393)
Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key
- Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered).
- `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens
- `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel`

Related to https://github.com/pytorch/pytorch/issues/155330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158393
Approved by: https://github.com/albanD
2025-08-13 21:00:59 +00:00
Svetlana Karslioglu
114a6c4043 Add placeholder for the User Guide (#159379)
- Add pytorch_overview.md
- Add pytorch_main_components.md
- Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials
- Move notes under user guide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159379
Approved by: https://github.com/albanD

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-13 14:56:04 +00:00
Paul de Supinski
7e91394955 Support NUMA Binding for Callable Entrypoints (#160163)
# Context
This is an extension of #149334.

# This PR
Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`.

Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary.

Other changes:
* Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).)
* Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints.

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran

```
$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt
```

and observed
* 6.6% remote memory accesses with 'node' bindings
* 11.6% remote without bindings

I also ran similar with `str` entrypoints as before just to be sure it's still working.

NOTE: [--run-path triggers the code to be run inside a `Callable`.](017259f9c6/torch/distributed/run.py (L870))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163
Approved by: https://github.com/d4l3k
2025-08-12 20:08:49 +00:00
morrison-turnansky
b9003ed3d8 Dynamo Deep Dive Documentation Fix (#158860)
changed SourceBuilder to VariableBuilder

Fixes #158447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860
Approved by: https://github.com/mlazos
2025-08-12 08:53:33 +00:00
Jane Xu
9b803cdbe2 [BE] Remove more optim entries from docs coverage ignore list (#160194)
This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2

If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-08-09 00:09:45 +00:00
Syed Tousif Ahmed
2247aa6d1d Documents tuning NVLink performance on H100/H200 (#159792)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159792
Approved by: https://github.com/ngimel
2025-08-08 20:28:24 +00:00
Andres Lugo
5f5f508aa8 [ROCm] Ck backend UX refactor (#152951)
Refactors how the enablement/disablement of CK Gemms and SDPA works.

- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-08-08 18:40:17 +00:00
Yu, Guangye
84f7e88aef Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-08 17:41:22 +00:00
PyTorch MergeBot
74da2604c9 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 15f1173e5d.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
Yu, Guangye
15f1173e5d Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-06 02:22:18 +00:00
Zheng, Zhaoqiong
0ba09a6d34 fix link for tutorial of inductor on windows (#159853)
fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853
Approved by: https://github.com/sekyondaMeta

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>
2025-08-05 18:37:47 +00:00
Oguz Ulgen
a29ed5e1ac Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-08-02 23:23:17 +00:00
Svetlana Karslioglu
e4e2701429 Add the RunLLM widget to the website (#152055)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055
Approved by: https://github.com/albanD
2025-07-31 20:53:53 +00:00
Neil Tenenholtz
1ebcba4e1b Fix typo in link to torch memory_viz tool (#159214)
Fixes a small typo in the torch_cuda_memory docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159214
Approved by: https://github.com/yewentao256, https://github.com/HDCharles, https://github.com/Skylion007
2025-07-31 18:50:54 +00:00
Boyuan Feng
435edbcb5d [Graph Partition] add graph partition doc (#159450)
This pr adds doc for graph partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450
Approved by: https://github.com/eellison
2025-07-30 17:01:10 +00:00
Svetlana Karslioglu
d214901133 Add a title to distributed._dist2.md (#159385)
Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build:
```
WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385
Approved by: https://github.com/d4l3k
2025-07-30 04:09:41 +00:00
PaliC
b57d1ef110 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158290
2025-07-30 01:36:03 +00:00
PaliC
dd7c996d5c [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
2025-07-30 01:36:03 +00:00
William Wen
df58db8831 [dynamo, docs] add recompilation, observability, reporting issues docs (#159062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159062
Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/anijain2305
2025-07-29 23:23:51 +00:00
anwang
c55e72bea1 [Re-land][Inductor] Support native Inductor as backend for MTIA (#159211)
The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error:
<img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" />
I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function.

So this diff/PR is an exactly copy of the previous one, except for adding the docstring.

-------------
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel
2025-07-29 17:03:24 +00:00
Justin Chu
de529ef002 [ONNX] onnx.md to simplify deprecated entities (#159312)
Simplify documentation of deprecated entities and remove the auto-generated page for JitScalarType
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159312
Approved by: https://github.com/titaiwangms
2025-07-29 14:24:17 +00:00
William Wen
ffccb90ff4 [dynamo, docs] add fullgraph=False docs (#159050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159050
Approved by: https://github.com/svekars, https://github.com/anijain2305
ghstack dependencies: #157985, #158055, #158531
2025-07-29 01:53:47 +00:00
William Wen
f916f34739 [dynamo, docs] non-strict programming model docs (#158531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158531
Approved by: https://github.com/AlannaBurke, https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #157985, #158055

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:47 +00:00
William Wen
c32994ce4b [docs, dynamo] add fullgraph=True, common graph breaks docs (#158055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158055
Approved by: https://github.com/AlannaBurke, https://github.com/anijain2305
ghstack dependencies: #157985

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-07-29 01:53:41 +00:00
William Wen
433e43cbec [dynamo, docs] programming model dynamo core concepts (#157985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157985
Approved by: https://github.com/svekars, https://github.com/anijain2305
2025-07-29 01:53:34 +00:00
PyTorch MergeBot
fe0ff12dab Revert "[Inductor] Support native Inductor as backend for MTIA (#158526)"
This reverts commit cd68559d04.

Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))
2025-07-26 17:58:00 +00:00
anwang
cd68559d04 [Inductor] Support native Inductor as backend for MTIA (#158526)
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526
Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison
2025-07-26 08:16:34 +00:00
Mikayla Gawarecki
36cf8f1ed8 [BE] Use .md instead of .rst for nn.aliases doc (#158666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158666
Approved by: https://github.com/janeyx99
ghstack dependencies: #158491, #158654
2025-07-25 22:03:55 +00:00
Mikayla Gawarecki
1e79872f2e [BE] More torch.nn docs coverage test (except for torch.nn.parallel) (#158654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158654
Approved by: https://github.com/janeyx99
ghstack dependencies: #158491
2025-07-25 22:03:55 +00:00
Mikayla Gawarecki
9e8f27cc79 [BE] Make torch.nn.modules.* satisfy the docs coverage test (#158491)
Options to address the "undocumented python objects":

1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change)
3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs.

#### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html)
This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except
- NLLLoss2d (deprecated)
- Container (deprecated)
- CrossMapLRN2d (what is this?)
- NonDynamicallyQuantizableLinear

This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491
Approved by: https://github.com/janeyx99
2025-07-25 22:03:55 +00:00
raghavhrishi
7ef3c3357d NUMA binding integration with elastic agent and torchrun (#149334)
Implements #148689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334
Approved by: https://github.com/d4l3k

Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>
2025-07-25 21:19:49 +00:00
Joel Schlosser
316c188a5e Remove torch.functional entries from the doc ignore list (#158581)
Options to address the "undocumented python objects":
1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking.
3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs.

This PR takes option (3) above and:
* Removes all 20 `torch.functional` entries from the doc ignore list
* Removes `torch.functional.align_tensors()` entirely, since we don't want to document it.
    * This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety.
* Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581
Approved by: https://github.com/janeyx99
2025-07-25 17:19:01 +00:00
PyTorch MergeBot
c8316d0e79 Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)"
This reverts commit 6ed2cb6ccd.

Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
PyTorch MergeBot
a9f6770edd Revert "[BE] Remove __reduce_deploy__ (#158291)"
This reverts commit 9c68c4d08f.

Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks.  @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))
2025-07-25 16:09:39 +00:00
Jeff Daily
9b29166f57 [ROCm] add flag torch.backends.miopen.immediate (#158951)
The MIOpen integration has changed over the years.  In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode.  But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True.  This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode.  Thus, immediate mode needs its own flag.  Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags.  This PR adds the first miopen-only flag to control its immediate mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 04:01:51 +00:00
Xuehai Pan
f5e2de928b [BE] fix remaining flake8 v7 warnings (#159044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044
Approved by: https://github.com/Skylion007
ghstack dependencies: #159043
2025-07-25 02:56:34 +00:00
Ti-Tai Wang
da35562bba [ONNX] Filter out torchscript sentences (#158850)
Fixes #157300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158850
Approved by: https://github.com/justinchuby, https://github.com/svekars
2025-07-24 20:59:06 +00:00
Wei (Will) Feng
693197eed6 [doc] remove FSDP1 developer note (#158991)
this resolve pytorch doc audit - we remove fsdp1 doc and promote fsdp2

https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158991
Approved by: https://github.com/svekars, https://github.com/mori360
ghstack dependencies: #158989
2025-07-24 08:21:54 +00:00
Wei (Will) Feng
68349118b5 [doc] add weifengpy to torch distributed pocs (#158989)
<img width="415" height="355" alt="Screenshot 2025-07-23 at 16 02 12" src="https://github.com/user-attachments/assets/35b6bb45-d5ed-4d74-8369-e8e66aaa2618" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158989
Approved by: https://github.com/mori360
2025-07-24 04:42:33 +00:00
Mikayla Gawarecki
7f649ed4f8 Add basic torch.hash_tensor op (#154149)
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.

- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
fduwjj
82f8e04f27 Update distributed maintainers (#158900)
I maintain couple components of distributed like devicemesh, c10d and PGNCCL, gloo, etc. Can I be marked not as emeritus? Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158900
Approved by: https://github.com/albanD
2025-07-23 21:53:27 +00:00
PaliC
9c68c4d08f [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
2025-07-23 20:27:28 +00:00
PaliC
6ed2cb6ccd [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
ghstack dependencies: #158288
2025-07-23 20:27:28 +00:00
drisspg
691736ae07 Add kernel options to flex docs (#158875)
Fixes https://github.com/pytorch/pytorch/issues/158741
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158875
Approved by: https://github.com/BoyuanFeng, https://github.com/albanD
2025-07-23 19:05:19 +00:00
Panagiotis Kourdis
fd47401536 [doc] Updates to distributed.md for XCCL backend (#155834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-22 21:01:43 +00:00
PyTorch MergeBot
6341311333 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 2ad5c25cfc.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))
2025-07-22 01:01:41 +00:00
PyTorch MergeBot
4c18e85300 Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)"
This reverts commit a6de309ca1.

Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:39 +00:00
PyTorch MergeBot
920f26c761 Revert "[BE] Remove __reduce_deploy__ (#158291)"
This reverts commit 0b9fb91f17.

Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))
2025-07-21 23:17:38 +00:00
Jane Xu
7cc5d03dfc Document the rest of the specific optimizer module APIs (#158669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158669
Approved by: https://github.com/albanD
ghstack dependencies: #158483
2025-07-19 07:27:15 +00:00
Jane Xu
f73594164a [BE] document Adadelta and Adagrad APIs properly (#158483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158483
Approved by: https://github.com/albanD
2025-07-19 07:27:15 +00:00
Svetlana Karslioglu
79e49efadd Pull latest Sphinx theme (#158595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595
Approved by: https://github.com/albanD
2025-07-18 18:46:47 +00:00
PyTorch MergeBot
9a7c2f1f64 Revert "Add torch compile force disable caches alias (#158072)"
This reverts commit 2ecf083b72.

Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))
2025-07-18 04:58:24 +00:00
angelayi
66c9bc5062 [export] Add runnable code to export docs (#158506)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/158506/export.html

Yay I can add runnable code to export docs now
Also moved export API reference to a different file.

With these changes, we can start to consolidate the [export tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) with the docs on pytorch docs. We just need to move the section on DDE and 0/1 specialization, and then I think we can delete the export tutorial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158506
Approved by: https://github.com/pianpwk, https://github.com/svekars
2025-07-17 20:15:22 +00:00
Oguz Ulgen
2ecf083b72 Add torch compile force disable caches alias (#158072)
Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072
Approved by: https://github.com/ezyang
2025-07-17 15:40:36 +00:00
Jiang, Yanbing
f4d8bc46c7 Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520)
### Description

This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.

```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```

Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-17 08:57:34 +00:00
PaliC
0b9fb91f17 [BE] Remove __reduce_deploy__ (#158291)
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).

Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
2025-07-17 05:56:26 +00:00
PaliC
a6de309ca1 [BE] Remove torch deploy | remove torch deploy specific files (#158290)
This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290
Approved by: https://github.com/albanD
ghstack dependencies: #158288
2025-07-17 05:56:18 +00:00
Yu, Guangye
2ad5c25cfc Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-07-17 01:56:01 +00:00
Yiming Zhou
a9ee4250d5 [4/n] Remove references to TorchScript in PyTorch docs (#158317)
Summary: jit.rst

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158317
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-16 20:01:34 +00:00
angelayi
1cc62c2cb9 [export] Update docs (#157750)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html

Changes:
* Rename draft_export.md -> export.draft_export.md for consistency.
* Removed non-strict section in export, instead pointed to programming model doc.
* Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs.
* Removed Specialization section in favor of programming model doc
* Added pt2 archive doc
* Cleaned up sidebar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750
Approved by: https://github.com/pianpwk
2025-07-16 19:53:12 +00:00
Jiang, Yanbing
900fba4c07 Update warning of TF32 (#158209)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158209
Approved by: https://github.com/jansel
2025-07-16 01:28:50 +00:00
Yiming Zhou
05dfd312cf [3/n] Remove references to TorchScript in PyTorch docs (#158315)
Summary:
- cpp_index.rst
- fx.md
- jit_builtin_functions.rst
- jit_python_reference.md
- jit_unsupported.md

cpu_threading
large_scale_deployment

Test Plan:
CI

Rollback Plan:

Differential Revision: D78309320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158315
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 21:14:18 +00:00
Yiming Zhou
0640cfa38c [2/n] Remove references to TorchScript in PyTorch docs (#158306)
Summary: Removed jit_language_reference.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306
Approved by: https://github.com/svekars, https://github.com/zhxchen17
2025-07-15 20:57:23 +00:00
Yiming Zhou
19625daf88 [1/n] Remove references to TorchScript in PyTorch docs (#158305)
Summary: Removed jit_language_reference_v2.md

Test Plan:
CI

Rollback Plan:

Differential Revision: D78308009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158305
Approved by: https://github.com/jingsh, https://github.com/svekars
2025-07-15 20:16:53 +00:00
Ti-Tai Wang
5606c516fd [ONNX] Remove legacy Dort (#158258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-07-15 19:14:06 +00:00
Jason Ansel
31326a9ad7 Fix typo in torch.set_float32_matmul_precision docs (#158191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158191
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-12 18:23:11 +00:00
Ti-Tai Wang
2eff14c445 [ONNX] Delete torch.onnx.dynamo_export (#158130)
It's deprecated since torch==2.7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158130
Approved by: https://github.com/justinchuby
2025-07-12 02:30:47 +00:00
Tristan Rice
0d77364ee3 dist2: cleanup non-option methods on PG (missing, timeouts) (#158123)
This updates the ProcessGroup.* API to include timeouts on all non-option based overloaded methods. This also adds 2 missing ones `alltoall_base` and `barrier`.

Following design in: https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89

Test plan:

```
pytest test/distributed/test_dist2.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158123
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2025-07-12 00:06:37 +00:00
Shivam Raikundalia
11d6ad8b2e [Docs] Update PT2 Profiler Torch-Compiled Region Image (#158066)
Summary: In Pytorch 2.5 we added source code attribution to PT2 traces. Each Torch-Compiled Region will now have its frame id and frame compile id associated with it. Update the image in the doc and add a description of this in the doc itself

Test Plan:
{F1980179183}

Rollback Plan:

Differential Revision: D78118228

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158066
Approved by: https://github.com/aaronenyeshi
2025-07-11 07:56:45 +00:00
zeshengzong
b4fc42ca80 Add torch.segment_reduce docs (#154352)
Fixes #153138

## Test Result

![image](https://github.com/user-attachments/assets/62346d62-d048-4259-906b-f8261e10b4cc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154352
Approved by: https://github.com/albanD
2025-07-11 06:16:38 +00:00
Jerry Zhang
11a86ad2fa Remove pytorch quant docs since we are moving to torchao (#157766)
Summary:
att

Test Plan:
doc page generated from CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157766
Approved by: https://github.com/Skylion007
2025-07-11 03:21:47 +00:00
Howard Huang
8532033679 RPC tutorial audit (#157938)
Fix [T228333894](https://www.internalfb.com/intern/tasks/?t=228333894)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157938
Approved by: https://github.com/AlannaBurke
2025-07-10 14:15:37 +00:00
Dmitry Rogozhkin
b146ca74f0 docs: add get_default_backend_for_device to distributed documentation (#156783)
`torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap.

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-07-10 05:11:30 +00:00
Tristan Rice
ed051c3084 torch.distributed: add initial _dist2 prototype API (#157841)
This adds the initial dist2 API as proposed in https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89

This is a WIP experimental API and is a sandbox for a number of new features and quality of life improvements/changes to c10d.

Test plan:

```
pytest test/distributed/test_dist2.py
```

Docs

```
cd docs
make html
```

![Screenshot 2025-07-08 at 13-39-23 Object Oriented Distributed API - torch distributed _dist2 — PyTorch main documentation](https://github.com/user-attachments/assets/9c03a7ec-09e5-42b9-8478-1ec28bc2b6bd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157841
Approved by: https://github.com/fduwjj
2025-07-09 23:40:43 +00:00
Dhia-naouali
eaf32fffb7 fixed a tiny typo in torch.compiler.md (#157462)
Fixes #157444

there was a typo in [docs/source/torch.compiler.md](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler.md) : see -> seen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157462
Approved by: https://github.com/Skylion007, https://github.com/svekars
2025-07-02 19:15:15 +00:00
PyTorch MergeBot
023887fc5a Revert "Switch to standard pep517 sdist generation (#152098)"
This reverts commit f16053f0c9.

Reverted https://github.com/pytorch/pytorch/pull/152098 on behalf of https://github.com/malfet due to IMO this PR needs to be split into few helper ones, with better test plan ([comment](https://github.com/pytorch/pytorch/pull/152098#issuecomment-3024223880))
2025-07-01 14:14:52 +00:00
Ti-Tai Wang
c174f3a6a5 [ONNX] Delete deprecated tutorial page link (#157310)
Related to https://github.com/pytorch/tutorials/issues/3420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157310
Approved by: https://github.com/justinchuby
2025-07-01 01:18:26 +00:00
Klaus Zimmermann
f16053f0c9 Switch to standard pep517 sdist generation (#152098)
Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now.

Closes #150461.

The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files.

This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file.

The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now.

## Issues

### sdist name
According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`.

### Symbolic Links
The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy.

PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target.

The list of symbolic links currently is as follows:

<details>

|source|target|problem|solution|
|-|-|-|-|
| `.dockerignore` | `.gitignore` |  ok (individual file) ||
| `docs/requirements.txt` | `../.ci/docker/requirements-docs.txt` |`..` in target|swap source and target[^1]|
| `functorch/docs/source/notebooks` | `../../notebooks/` |`..` in target|swap source and target[^1]|
| `.github/ci_commit_pins/triton.txt` | `../../.ci/docker/ci_commit_pins/triton.txt` |  ok (omitted from sdist)||
| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` | `../../CONTRIBUTING.md` |`..` in target|omit from sdist[^2]|
| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` | `../../../../tests/DictionaryLookup` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/MyGame` | `../../../../tests/MyGame` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceA` | `../../../../tests/namespace_test/NamespaceA` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceC` | `../../../../tests/namespace_test/NamespaceC` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/optional_scalars` | `../../../../tests/optional_scalars` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/union_vector` | `../../../../tests/union_vector` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` | `../../../../java/src/main/java` |`..` in target|omit from sdist[^3]|
| `third_party/ittapi/rust/ittapi-sys/c-library` | `../../` |`..` in target|omit from sdist[^4]|
| `third_party/ittapi/rust/ittapi-sys/LICENSES` | `../../LICENSES` |`..` in target|omit from sdist[^4]|
| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` | `./pre-commit` | ok (individual file)||
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/XNNPACK/tools/xngen` | `xngen.py` |  ok (individual file)||

</details>

The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`.

[^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well.

[^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul.

[^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist.

[^4]: We don't need to ship the rust bindings for ittapi.

[^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted.

### Nccl
Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`.
This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download.
However, it *is* possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098
Approved by: https://github.com/atalman
2025-06-30 19:07:34 +00:00
Saiteja Samudrala
2796f31b5e [DCP] OSS Zero Overhead Checkpointing Implementation (#156207)
Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing

Test Plan: Test with TorchTitan on this PR: https://github.com/pytorch/torchtitan/pull/1287

Differential Revision: D72391401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156207
Approved by: https://github.com/teja-rao
2025-06-29 03:19:48 +00:00
Justin Chu
5692cbb818 [ONNX] Delete symbolic caffe2 (#157102)
Caffe2 is removed from pytorch. This is a clean up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157102
Approved by: https://github.com/titaiwangms, https://github.com/cyyever
2025-06-28 05:22:02 +00:00
Jane Xu
4048a144ab Address richard's comments on libtorch_stable_abi note (#156324)
Followups from #155984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156324
Approved by: https://github.com/zou3519
2025-06-27 19:19:12 +00:00
Svetlana Karslioglu
2860f5c4f5 Remove mentioning of TorchScript in Export doc (#156969)
Remove mentioning of TorchScript

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156969
Approved by: https://github.com/angelayi

Co-authored-by: Angela Yi <yiangela7@gmail.com>
2025-06-27 17:59:15 +00:00
rzou
aa2d54148d Add AOTDispatcher config to set backward autocast behavior (#156356)
This PR adds a new config `backward_pass_autocast`, to set the backward autocast
behavior. It does not change the existing behavior.

The reason why we need this is that torch.compile acquires a forward and
backward graph at the time of the forward pass. This means that
implemented naively, if there are any context managers active outside
the call to torch.compile, the backward graph will also get the
behaviors from those context managers. This PR gives users a way to
tweak the autocast behavior of the backward pass.

Please see torch._functorch.config for the options to the
`backward_pass_autocast` config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156356
Approved by: https://github.com/bdhirsh
ghstack dependencies: #155354
2025-06-27 14:58:58 +00:00
Vasiliy Kuznetsov
414ad47045 revamp dtype documentation for 2025 (#156087)
The dtype documentation has not been updated in awhile, let's do a revamp.

1. combine the duplicated docs for dtypes from `tensors.rst` and `tensor_attributes.rst` to live in `tensor_attributes.rst`, and link to that page from `tensors.rst`
2. split the dtype table into floating point and integer dtypes
3. add the definition of shell dtype
4. add the float8 and MX dtypes as shell dtypes to the dtype table
5. remove legacy quantized dtypes from the table
6. add the definition of various dtype suffixes ("fn", etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156087
Approved by: https://github.com/albanD
2025-06-27 13:10:23 +00:00
Geevarghese George
7f6e7103a3 Convert to markdown: jit_python_reference.rst, jit_unsupported.rst, jit_utils.rst, library.rst (#155404)
Fixes #155024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155404
Approved by: https://github.com/svekars
2025-06-26 21:09:46 +00:00
haozhe.zhu
53e0b9c393 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-06-26 10:32:20 +00:00
Mikayla Gawarecki
2c6324a1eb Delete sections referencing torchscript in serialization docs (#156648)
Address [T228333890](https://www.internalfb.com/intern/tasks/?t=228333890)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156648
Approved by: https://github.com/svekars
2025-06-25 23:41:24 +00:00
albanD
a25d1443fa Mark TorchServe as all emeritus (#156865)
As per title and to follow the broader tutorial cleanup work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156865
Approved by: https://github.com/svekars, https://github.com/malfet, https://github.com/seemethere
2025-06-25 23:34:57 +00:00
Yu, Guangye
9b498d3bb2 Update docs for torch.device (#156686)
# Motivation
Update the doc, to make `torch.device`'s constructor officially support the following methods:
- A device string, which is a string representation of the device type and optionally the device ordinal.
- A device type and a device ordinal.
- A device ordinal, which is treated as the current accelerator type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156686
Approved by: https://github.com/albanD
2025-06-25 02:12:36 +00:00
PyTorch MergeBot
6459a5c7a9 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 35e44067c4.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))
2025-06-25 00:11:35 +00:00
Yu, Guangye
35e44067c4 Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-06-24 07:57:48 +00:00
Edward Z. Yang
17eb649d55 Implement guard collectives (optimized version) (#156562)
This is a remix of https://github.com/pytorch/pytorch/pull/155558

Instead of mediating guard collective via a config option, in this one it's done via a `set_stance` like API. The motivation is that checking for the config value on entry on torch.compile is apparently quite expensive, according to functorch_maml_omniglot. So this makes it a bit cheaper.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156562
Approved by: https://github.com/Microve
2025-06-24 04:59:49 +00:00
Animesh Jain
fab85fc5f9 [compile][hierarchical compilation] Release nested_compile_region API (#156449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156449
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-21 15:14:59 +00:00
Xuehai Pan
2ccfd14e23 [BE] fix typos in docs/ (#156080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-06-21 02:47:32 +00:00
Justin Chu
fbbab794ef [ONNX] Implement Attention-23 (#156431)
Implement Attention-23 using sdpa and flexattention.

- I used copilot for this.
- Also updated the conversion logic to remove trailing None inputs.

@gramalingam @kunal-vaishnavi @titaiwangms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431
Approved by: https://github.com/titaiwangms

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 23:54:57 +00:00
Bhagirath Mehta
de1930a429 Add ONNX dynamo metadata documentation (#155816)
Describe auto-generated metadata when calling torch.onnx.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 20:12:22 +00:00
sekyondaMeta
18e4c461fb Update index.md (#155143)
Related to: https://github.com/pytorch/pytorch/issues/152134
Update to index.md to add language for Stable and Unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143
Approved by: https://github.com/AlannaBurke, https://github.com/atalman

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-20 18:53:32 +00:00
windsonsea
9944cd0949 Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520)
Related to #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 18:46:04 +00:00
nirajkamalk
202d2ae53a Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430)
Fixes #155033

- [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst)
- [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst)
- [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst)
- [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size.
- [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 01:27:04 +00:00
Jane Xu
e8bfce9a43 Document how to use stack-based APIs with StableIValue (#155984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-06-18 01:10:23 +00:00
PyTorch MergeBot
fa4f07b5b8 Revert "[Docs] Convert to markdown to fix 155032 (#155520)"
This reverts commit cd66ff8030.

Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
2025-06-17 22:22:50 +00:00
windsonsea
cd66ff8030 [Docs] Convert to markdown to fix 155032 (#155520)
Fix #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-17 20:29:45 +00:00
Svetlana Karslioglu
fc5ae12293 Fix issue with right-nav (#156119)
Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example:
* Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty.
* Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed
<img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" />
vs
<img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119
Approved by: https://github.com/albanD
2025-06-17 18:09:51 +00:00
windsonsea
7fcad0231c [Docs] Convert to markdown to fix 155025 (#155789)
Related to #155025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789
Approved by: https://github.com/svekars
2025-06-17 15:08:14 +00:00
Wei (Will) Feng
4a8f5e752b [FSDP2] explain user contract for fully_shard (#156070)
<img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070
Approved by: https://github.com/mori360
2025-06-17 10:03:19 +00:00
Justin Silver
008345be9d Fix #155018 (convert distributed rst to markdown) (#155528)
Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/)

Fixes #155018

Docs comparison (check out the 'new' whenever docs build)

1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html))
2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html))
3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html))
4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html))
5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528
Approved by: https://github.com/wz337, https://github.com/svekars
2025-06-16 20:46:09 +00:00
Julian De la Barrera Brandner
2dc1627451 [doc] Add documentation for division by zero behavior in autograd (#155987)
Fixes #128796

This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains:

1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic
2. How autograd handles these cases and why masking after division can lead to `nan` gradients
3. Provides concrete examples showing the issue
4. Recommends two solutions:
   - Masking before division
   - Using MaskedTensor (experimental API)

The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue.

This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models.

dditional changes:
- Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987
Approved by: https://github.com/soulitzer

Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>
2025-06-16 19:02:12 +00:00
windsonsea
fbd88ae2b5 Convert to markdown: checkpoint.rst (#156009)
Related to #155014

Use two commits to have a try.
```bash
 1800  git mv docs/source/checkpoint.rst docs/source/checkpoint.md
 1802  git commit -m "[Docs] Rename checkpoint.rst"
 1803  git push origin ckpoint

# update the markdown file
 1805  git add .
 1806  git commit -m "modify checkpoint.md"
 1807  git push origin ckpoint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009
Approved by: https://github.com/svekars
2025-06-16 17:48:23 +00:00
windsonsea
a10024d7de Convert complex_numbers.rst to markdown (#156039)
Related to #155014

Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039
Approved by: https://github.com/svekars
2025-06-16 17:24:37 +00:00
Dhia naouali
c620d0b5c7 convert: rst to myst pr2/2 (#155911)
Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following three .rst files with the mentioned referenced in each file

- [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst)
  - torch.compiler_troubleshooting
  - nonsupported_numpy_feats
  - torchdynamo_fine_grain_tracing

- [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst)
  - None

- [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst)
  - torch.compiler_overview
  - torch.compiler_api
  - torchdynamo_fine_grain_tracing

I made the suggested edits by the maintainers as commented in the parent PR
(used git mv on all files, yet it still appeared as delete-create action)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-16 00:44:44 +00:00
Animesh Jain
54976bca10 [dynamo] Provide helper functions for guard filter hook (#155083)
Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083
Approved by: https://github.com/zhxchen17, https://github.com/jansel
2025-06-15 17:49:36 +00:00
Edgar Romo Montiel
ca3cabd24a Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696)
Fixes #155028

This pull request updates the documentation  by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-14 03:32:00 +00:00
dggaytan
3003c681ef Converting .rst files to .md files (#155377)
Fixes #155036
This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-13 22:54:27 +00:00