pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	1e0656f063	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit `de893e96c7`. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))	2025-09-08 07:04:36 +00:00
PyTorch MergeBot	104f2680e0	Revert "Add return-max-scores to flex-attention (#161667 )" This reverts commit `486b20b73c`. Reverted https://github.com/pytorch/pytorch/pull/161667 on behalf of https://github.com/huydhn due to Sorry for reverting your change but reverting https://github.com/pytorch/pytorch/pull/161730 does not seem to fix all trunk failures ([comment](https://github.com/pytorch/pytorch/pull/161667#issuecomment-3263512642))	2025-09-07 06:00:55 +00:00
Edward Z. Yang	b2b4add0e7	Docs on export joint with descriptors (#159006 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159006 Approved by: https://github.com/SherlockNoMad	2025-09-06 03:02:58 +00:00
drisspg	486b20b73c	Add return-max-scores to flex-attention (#161667 ) # Summary ### Update API ```Py class AuxRequest(NamedTuple): """Request which auxiliary outputs to compute from flex_attention. Each field is a boolean indicating whether that auxiliary output should be computed. """ lse: bool = False max_scores: bool = False class AuxOutput(NamedTuple): """Auxiliary outputs from flex_attention operation. Fields will be None if not requested, or contain the tensor if requested. """ lse: Optional[Tensor] = None max_scores: Optional[Tensor] = None out_only = flex_attention(query, key, value, score_mod) out_max, aux_max = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(max_scores=True), ) out_both, aux_both = flex_attention( query, key, value, score_mod, return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True), ) ``` Returns the max post mod scores from flex attention. Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups. Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args. We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors. ### Req Grad I currently dont return a max_scores that supports backproping grads. I think this might be feasible but since max is essentially 1 hot on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch). For now no grad, we can re-visit if needed. ## Perf I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path. ```Shell 🔝 Top 5 TFlops Deltas (by absolute %): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ 🔺 Top 5 Positive TFlops Deltas (highest +%): shape: (5, 7) ┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡ │ causal ┆ torch.bfloat16 ┆ (4, 16, 2048, 16, ┆ 249.514658 ┆ 243.078974 ┆ 6.435684 ┆ 2.647569 │ │ ┆ ┆ 2048, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 57.971274 ┆ 56.633641 ┆ 1.337633 ┆ 2.361905 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ noop ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 280.71254 ┆ 275.686991 ┆ 5.025549 ┆ 1.822918 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16, ┆ 152.970031 ┆ 150.489109 ┆ 2.480923 ┆ 1.648573 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 161.031318 ┆ 158.597808 ┆ 2.43351 ┆ 1.534391 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘ 🔻 Top 5 Negative TFlops Deltas (lowest -%): shape: (5, 7) ┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 1024, 16, ┆ 244.052884 ┆ 248.65129 ┆ -4.598406 ┆ -1.849339 │ │ ┆ ┆ 1024, 64) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 4, ┆ 175.546923 ┆ 177.81205 ┆ -2.265127 ┆ -1.273888 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, ┆ 156.282597 ┆ 158.209134 ┆ -1.926537 ┆ -1.217715 │ │ ┆ ┆ 16384, 64) ┆ ┆ ┆ ┆ │ │ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16, ┆ 232.542929 ┆ 235.140136 ┆ -2.597207 ┆ -1.104536 │ │ ┆ ┆ 2048, 128) ┆ ┆ ┆ ┆ │ │ alibi ┆ torch.bfloat16 ┆ (2, 16, 1024, 16, ┆ 169.652791 ┆ 171.475986 ┆ -1.823195 ┆ -1.063236 │ │ ┆ ┆ 1024, 128) ┆ ┆ ┆ ┆ │ └────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667 Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng	2025-09-05 23:21:46 +00:00
Edward Yang	de893e96c7	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-05 20:15:11 +00:00
PyTorch MergeBot	adae7f66aa	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit `c37103234a`. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))	2025-09-05 18:58:47 +00:00
Edward Yang	c37103234a	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-04 19:43:17 +00:00
Frank Lin	0c0e056a9e	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-04 17:21:26 +00:00
William Wen	f36f285953	[dynamo] change error_on_graph_break/fullgraph semantics (#161747 ) This PR implements the semantics change to `torch._dynamo.error_on_graph_break`: - ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~ - `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks. - `error_on_graph_break` does nothing when `fullgraph=True` - `error_on_graph_break` does NOT guarantee a single graph Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation: - `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled - `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time - `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time Pull Request resolved: https://github.com/pytorch/pytorch/pull/161747 Approved by: https://github.com/mlazos ghstack dependencies: #161739	2025-09-04 17:10:17 +00:00
PyTorch MergeBot	b7dad7dd49	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit `90b08643c3`. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358))	2025-09-04 15:25:07 +00:00
Saurabh Mishra	1281470155	[DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682 ) This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced: - Configuration the target DType and the block size. - Multi threaded dequantization for efficiency Test Plan: buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage ``` Time elapsed: 2:34.1s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D80174674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682 Approved by: https://github.com/ankitageorge	2025-09-04 01:09:53 +00:00
Edward Yang	90b08643c3	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-03 07:33:55 +00:00
FFFrog	d789451ff6	[OpenReg] Migrate Accelerator Document from source/notes into source/accelerator (#161845 ) As the tile stated. As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator". Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845 Approved by: https://github.com/albanD	2025-09-03 03:12:18 +00:00
PyTorch MergeBot	4e42aa8ffc	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit `b7034e9c92`. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))	2025-09-02 20:28:42 +00:00
Justin Chu	524b78d4f6	[ONNX] Refactor torchscript based exporter (#161323 ) Refactor torchscript based exporter logic to move them to a single (private) location for better code management. Original public module and method apis are preserved. - Updated module paths in `torch/csrc/autograd/python_function.cpp` accordingly - Removed `check_onnx_broadcast` from `torch/autograd/_functions/utils.py` because it is private&unused @albanD / @soulitzer could you review changes in `torch/csrc/autograd/python_function.cpp` and `torch/autograd/_functions/utils.py`? Thanks! ## BC Breaking - Deprecated members in `torch.onnx.verification` are removed Differential Revision: [D81236421](https://our.internmc.facebook.com/intern/diff/D81236421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161323 Approved by: https://github.com/titaiwangms, https://github.com/angelayi	2025-09-02 16:10:30 +00:00
Dev Sashidhar	d5e0f4202b	Fixes broken memory_viz link in CUDA memory docs (#161426 ) Fixes #161375 The "Using the visualizer" section in torch_cuda_memory.md had a link to https://pytorch.org/memory_viz written in inline Markdown link form. Strangely the same syntax worked earlier on the page as the issuer mentioned, but in this spot it's rendered sa a broken link. I wasn't able to pinpoint why the second occurrence was treated differently, but switching it to the Markdown autolink form fixes the problem consistently. I tested this by rebuilding the docs locally with make html and serving the HTML with a local http.server. With the autolink, the link resolves correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161426 Approved by: https://github.com/soulitzer	2025-09-02 02:06:54 +00:00
Edward Yang	b7034e9c92	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-01 23:00:21 +00:00
PyTorch MergeBot	63a9c23fe9	Revert "[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 )" This reverts commit `190c391a28`. Reverted https://github.com/pytorch/pytorch/pull/158352 on behalf of https://github.com/atalman due to Broke cuda 13.0 nightly builds https://github.com/pytorch/pytorch/actions/runs/17382188549/job/49341981474 ([comment](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3242871629))	2025-09-01 16:27:03 +00:00
Frank Lin	190c391a28	[CUDA] Reuse blocks with record_stream during CUDA Graph capture in the CUDACachingAllocator (#158352 ) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it capturing graph) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True\|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * Free marker: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * Terminal: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered reusable on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](https://github.com/pytorch/pytorch/pull/158352#pullrequestreview-3112565198)). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158352 Approved by: https://github.com/ngimel Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-09-01 09:25:01 +00:00
Zheng, Zhaoqiong	6737e2c996	update supported OS for Intel client GPU (#161699 ) update supported OS for Intel client GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/161699 Approved by: https://github.com/chuanqi129, https://github.com/malfet	2025-09-01 05:45:09 +00:00
Paul de Supinski	768a1017c5	Allow parallel start NUMA binding (#161576 ) # Context In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per thread.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually can safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576 Approved by: https://github.com/d4l3k	2025-08-28 01:15:58 +00:00
FFFrog	d2db6c86b0	[OpenReg] Add Develop Notes for Integrating New Backend into PyTorch (#158644 ) To facilitate the integration of the new backend, we plan to publish a new development note that details all the key components,hoping to speed up the development of other accelerators. This PR is the beginning of this note, and involve the part of registration of operators and we will gradually improve it and keep in sync with OpenReg's code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158644 Approved by: https://github.com/albanD	2025-08-27 14:47:25 +00:00
PyTorch MergeBot	9f6e1b8730	Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864 )" This reverts commit `3caddd4daa`. Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))	2025-08-26 20:03:59 +00:00
Will Constable	e3d68dfae2	[DTensor] Make default RNG semantics match user-passed generator (#160482 ) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on #159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes #159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482 Approved by: https://github.com/wanchaol	2025-08-25 04:21:19 +00:00
Chuanhao Zhuge	74280d0913	[muon] Introduce Muon optimizer to PyTorch (#160213 ) A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy. This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory. Usage ```python model = MyModelForCausalLM # filter out your params manually muon_params = [...] adamw_params = [...] muon = Muon( params = muon_params lr=lr, wd=wd, ) adamw = AdamW( params = adamw_params lr=lr, wd=wd, ) # in training loop loss = model(input) loss.backward() muon.step() adamw.step() muon.zero_grad() adamw.zero_grad() ``` ~~Additional usage~~ ~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~ ```python ~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~ ~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~ ``` As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs. By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW. Testing ~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~ As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test. 2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency. <img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" /> Numerics We evaluate our implementation with existing implementation to confirm numerical consistency. As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap. As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below. ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = KellySingleDeviceMuon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without. <img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" /> Performance Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP: - adamw_ddp finishes in 13.12 min - pytorch_muon_ddp finishes in 13.45 min Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW. AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms <img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" /> Muon: Optimizer.step() takes ~54 ms, step time ~960 ms <img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" /> Note We restrict the implementation to accept only 2D parameters. An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful. Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound. Next Steps 1. Add `MuP` 2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices. 3. Open-source unsharded Muon co-designed with FSDP2. **** Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213 Approved by: https://github.com/janeyx99	2025-08-24 08:03:04 +00:00
Paul de Supinski	33346b5814	Support NUMA Binding for Callable Entrypoints, Take 2 (#161183 ) # Context In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges. Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for both `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183 Approved by: https://github.com/d4l3k	2025-08-23 07:23:22 +00:00
Justin Chu	419a2dbf5f	[ONNX] Remove enable_fake_mode and exporter_legacy (#161222 ) Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222 Approved by: https://github.com/titaiwangms	2025-08-22 22:15:27 +00:00
PyTorch MergeBot	c7a77470c5	Revert "[DTensor] Make default RNG semantics match user-passed generator (#160482 )" This reverts commit `d1faf2ef04`. Reverted https://github.com/pytorch/pytorch/pull/160482 on behalf of https://github.com/jeffdaily due to failing cuda and rocm jobs ([comment](https://github.com/pytorch/pytorch/pull/160482#issuecomment-3214694297))	2025-08-22 15:04:28 +00:00
Will Constable	d1faf2ef04	[DTensor] Make default RNG semantics match user-passed generator (#160482 ) Previously, DTensor kept its own copy of the generator state after the first time a random operator was called on a DTensor. This copy would evolve independently from the generator outside of DTensor. After adding support for users to pass a specific generator into random operators (e.g. `uniform_(..., generator=)`), it was determined (in discussion on #159991) to change the semantics so that any random operations performed on DTensor would evolve the state of the publicly visible generators (either the default one or user-passed one). The upsides are (1) it is now possible to call torch.manual_seed() at any point in the program and have a consistent effect on DTensor, (2) DTensor ops have an observable effect on the generator. The downside is that users are now responsible for seeding their generator before using DTensor, ensuring all ranks use the same seed. Fixes #159991 confirmed docs rendered OK <img width="897" height="414" alt="image" src="https://github.com/user-attachments/assets/c082f0f0-5447-47aa-834f-65342eb237cd" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160482 Approved by: https://github.com/wanchaol	2025-08-21 22:02:16 +00:00
Andy Lugo	3caddd4daa	[ROCm] SDPA fix mem fault when dropout is enabled (#154864 ) Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-21 14:23:13 +00:00
Jane Xu	8f766d6839	Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557 ) TL;DR: Moving to ScalarType in user extensions and removing deprecated dtypes. This change _modifies_ the from/to behavior between ScalarType and StableValue! Whereas before, user extensions could only in abstract pass around obfuscated dtypes appearing as int32_ts, now, users can confidently use torch::headeronly::ScalarType in their extensions for major scalar types. This PR enables ABI stability by adding a translation layer through the shim, so that even if the ScalarType enum values change in the future, user extensions need not fear. Then we add a Tensor scalar_type API which reuses the from/to logic to return to the user a nice ScalarType (vs an abstracted int32_t). I then changed the test to test the scalar_type API. This code change required some refactoring because of circular dependencies. ## BC Breaking note This commit is (narrowly) BC-breaking for unpopular dtypes: `quint`s, `qint`s, `Bits`, `dummy_uint`s, `dummy_int*`s, `Float8_e8m0fnu`, and `Float4_e2m1fn_x2` in the narrow use case where an extension retrieves a Tensor dtype of the above and passes it into `aoti_torch_call_dispatcher`. As of now, I believe there are 0 users of this use case, so the benefits of this change significantly justify BC-breaking this API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160557 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2025-08-19 22:13:47 +00:00
PyTorch MergeBot	eba20d2d74	Revert "[WIP] Merge Test (#160998 )" This reverts commit `ef761c4353`. Reverted https://github.com/pytorch/pytorch/pull/160998 on behalf of https://github.com/ZainRizvi due to Undoing test merge ([comment](https://github.com/pytorch/pytorch/pull/160998#issuecomment-3202125839))	2025-08-19 20:30:39 +00:00
John Stawinski	ef761c4353	[WIP] Merge Test (#160998 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160998 Approved by: https://github.com/ZainRizvi	2025-08-19 20:26:07 +00:00
FFFrog	284b719005	Remove the uncessary empty file (#160728 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160728 Approved by: https://github.com/Skylion007	2025-08-19 10:54:08 +00:00
henrylhtsang	98373e5ad2	[doc] AOTI debugging guide (#160430 ) Folded from https://discuss.pytorch.org/t/a-beginners-guide-to-debugging-aot-inductor-cuda-illegal-memory-access/222188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160430 Approved by: https://github.com/angelayi	2025-08-14 23:42:17 +00:00
Howard Huang	198b5fd2d4	[PP] Add DualPipeV schedule (#159591 ) Added the DualPipeV schedule according to http://github.com/deepseek-ai/DualPipe/blob/main/dualpipe/dualpipev.py#L11 <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/4e843bb9-87cd-4d11-936c-7dfe8ee12f16" /> This schedule doesn't perform the actual "overlap" during execution, but provides the scaffolding and schedule definition we need to run it E2E in torchtitan. Supporting the overlapped operation will be worked on in following PRs. Tests: ```sh python test/distributed/pipelining/test_schedule_multiproc.py -k test_v_shape_schedules python test/distributed/pipelining/test_schedule.py -k test_pipeline_order_for_v_schedules ``` Also tested in TorchTitan and is running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159591 Approved by: https://github.com/wconstab	2025-08-14 14:58:35 +00:00
RajeshvShiyal	5ace061254	finfo eps doc fix (#160502 ) Existing documentation for torch.finfo().eps is as below: \| eps \| float \| The smallest representable number such that ``1.0 + eps != 1.0``. \| Proposed documentation for torch.finfo().eps is as below: \| eps \| float \| The difference between 1.0 and the next smallest representable float larger than 1.0. \| Fixes #160397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160502 Approved by: https://github.com/ngimel	2025-08-14 01:49:35 +00:00
Mikayla Gawarecki	1196bb1c2e	Add utility to get computed kernel in torch.library (#158393 ) Adds `OperatorEntry::getComputedKernelForDispatchKey` which returns the KernelFunction corresponding to `OperatorEntry.dispatchTable_[dispatch_ix]` for a given dispatch key - Specifically it returns a `SafeKernelFunction` that holds a `KernelToken`. This `KernelToken` is registered to the `KernelFunction` in `OperatorEntry.kernels_` and will be invalidated when the `KernelFunction` is destructed (i.e. when the `AnnotatedKernel` that holds this `KernelFunction` is removed from `kernels_`, which happens when the corresponding impl is deregistered). - `SafeKernelFunction` can be called via `callBoxed`, the validity of the token will be checked before this happens - `SafeKernelFunction` is pybinded and `getComputedKernelForDispatchKey` is exposed to the frontend ia `torch.library.get_kernel` Related to https://github.com/pytorch/pytorch/issues/155330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158393 Approved by: https://github.com/albanD	2025-08-13 21:00:59 +00:00
Svetlana Karslioglu	114a6c4043	Add placeholder for the User Guide (#159379 ) - Add pytorch_overview.md - Add pytorch_main_components.md - Reorganize top nav to have Get Started, User Guide, Reference API, Community, Tutorials - Move notes under user guide Pull Request resolved: https://github.com/pytorch/pytorch/pull/159379 Approved by: https://github.com/albanD Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-13 14:56:04 +00:00
Paul de Supinski	7e91394955	Support NUMA Binding for Callable Entrypoints (#160163 ) # Context This is an extension of #149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 \| tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 \| tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](`017259f9c6/torch/distributed/run.py (L870)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163 Approved by: https://github.com/d4l3k	2025-08-12 20:08:49 +00:00
morrison-turnansky	b9003ed3d8	Dynamo Deep Dive Documentation Fix (#158860 ) changed SourceBuilder to VariableBuilder Fixes #158447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158860 Approved by: https://github.com/mlazos	2025-08-12 08:53:33 +00:00
Jane Xu	9b803cdbe2	[BE] Remove more optim entries from docs coverage ignore list (#160194 ) This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2 If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-08-09 00:09:45 +00:00
Syed Tousif Ahmed	2247aa6d1d	Documents tuning NVLink performance on H100/H200 (#159792 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159792 Approved by: https://github.com/ngimel	2025-08-08 20:28:24 +00:00
Andres Lugo	5f5f508aa8	[ROCm] Ck backend UX refactor (#152951 ) Refactors how the enablement/disablement of CK Gemms and SDPA works. - Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms. - USE_ROCM_CK_GEMM is set to True by default on Linux - Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA. - USE_ROCM_CK_SDPA is set to False by default - (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release) - Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it. - the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951 Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-08-08 18:40:17 +00:00
Yu, Guangye	84f7e88aef	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-08 17:41:22 +00:00
PyTorch MergeBot	74da2604c9	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit `15f1173e5d`. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))	2025-08-07 16:34:36 +00:00
Yu, Guangye	15f1173e5d	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-08-06 02:22:18 +00:00
Zheng, Zhaoqiong	0ba09a6d34	fix link for tutorial of inductor on windows (#159853 ) fix link issue from https://docs.pytorch.org/tutorials/prototype/inductor_windows.html to https://docs.pytorch.org/tutorials/unstable/inductor_windows.html due to structure change with pr https://github.com/pytorch/tutorials/pull/3489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159853 Approved by: https://github.com/sekyondaMeta Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Co-authored-by: Zesheng Zong <zesheng.zong@outlook.com>	2025-08-05 18:37:47 +00:00
Oguz Ulgen	a29ed5e1ac	Add torch compile force disable caches alias (#158072 ) Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072 Approved by: https://github.com/ezyang	2025-08-02 23:23:17 +00:00
Svetlana Karslioglu	e4e2701429	Add the RunLLM widget to the website (#152055 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055 Approved by: https://github.com/albanD	2025-07-31 20:53:53 +00:00
Neil Tenenholtz	1ebcba4e1b	Fix typo in link to torch memory_viz tool (#159214 ) Fixes a small typo in the torch_cuda_memory docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/159214 Approved by: https://github.com/yewentao256, https://github.com/HDCharles, https://github.com/Skylion007	2025-07-31 18:50:54 +00:00
Boyuan Feng	435edbcb5d	[Graph Partition] add graph partition doc (#159450 ) This pr adds doc for graph partition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450 Approved by: https://github.com/eellison	2025-07-30 17:01:10 +00:00
Svetlana Karslioglu	d214901133	Add a title to distributed._dist2.md (#159385 ) Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build: ``` WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385 Approved by: https://github.com/d4l3k	2025-07-30 04:09:41 +00:00
PaliC	b57d1ef110	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158290	2025-07-30 01:36:03 +00:00
PaliC	dd7c996d5c	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD	2025-07-30 01:36:03 +00:00
William Wen	df58db8831	[dynamo, docs] add recompilation, observability, reporting issues docs (#159062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159062 Approved by: https://github.com/svekars, https://github.com/zou3519, https://github.com/anijain2305	2025-07-29 23:23:51 +00:00
anwang	c55e72bea1	[Re-land][Inductor] Support native Inductor as backend for MTIA (#159211 ) The previous [diff/PR] (https://github.com/pytorch/pytorch/pull/158526) was reverted due to this docstring lint error: <img width="1736" height="722" alt="image" src="https://github.com/user-attachments/assets/216b1720-4002-48da-b5f3-32b5d48aaa54" /> I didn't add the docstring cause I thought I'm not supposed to add docstring for an EXISTING function. So this diff/PR is an exactly copy of the previous one, except for adding the docstring. ------------- This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D79040806](https://our.internmc.facebook.com/intern/diff/D79040806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159211 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel	2025-07-29 17:03:24 +00:00
Justin Chu	de529ef002	[ONNX] onnx.md to simplify deprecated entities (#159312 ) Simplify documentation of deprecated entities and remove the auto-generated page for JitScalarType Pull Request resolved: https://github.com/pytorch/pytorch/pull/159312 Approved by: https://github.com/titaiwangms	2025-07-29 14:24:17 +00:00
William Wen	ffccb90ff4	[dynamo, docs] add fullgraph=False docs (#159050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159050 Approved by: https://github.com/svekars, https://github.com/anijain2305 ghstack dependencies: #157985, #158055, #158531	2025-07-29 01:53:47 +00:00
William Wen	f916f34739	[dynamo, docs] non-strict programming model docs (#158531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158531 Approved by: https://github.com/AlannaBurke, https://github.com/mlazos, https://github.com/anijain2305 ghstack dependencies: #157985, #158055 Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-07-29 01:53:47 +00:00
William Wen	c32994ce4b	[docs, dynamo] add fullgraph=True, common graph breaks docs (#158055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158055 Approved by: https://github.com/AlannaBurke, https://github.com/anijain2305 ghstack dependencies: #157985 Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-07-29 01:53:41 +00:00
William Wen	433e43cbec	[dynamo, docs] programming model dynamo core concepts (#157985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157985 Approved by: https://github.com/svekars, https://github.com/anijain2305	2025-07-29 01:53:34 +00:00
PyTorch MergeBot	fe0ff12dab	Revert "[Inductor] Support native Inductor as backend for MTIA (#158526 )" This reverts commit `cd68559d04`. Reverted https://github.com/pytorch/pytorch/pull/158526 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/158526#issuecomment-3122186057))	2025-07-26 17:58:00 +00:00
anwang	cd68559d04	[Inductor] Support native Inductor as backend for MTIA (#158526 ) This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly. The changes include: - Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc. - Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc. - MTIA specific codegen logic, for example, loading MTIA dynamic_library. - Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU. - Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend. - A change in Inductor runtime to avoid re-initialize MTIADriver. - BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag. - Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag. - Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose. Note: - This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead. - MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen. Internal: References: - [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/) - [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb) - [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w) - [early prototying diff](https://www.internalfb.com/diff/D75110196) - [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959) - [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678) Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526 Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison	2025-07-26 08:16:34 +00:00
Mikayla Gawarecki	36cf8f1ed8	[BE] Use .md instead of .rst for nn.aliases doc (#158666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158666 Approved by: https://github.com/janeyx99 ghstack dependencies: #158491, #158654	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	1e79872f2e	[BE] More torch.nn docs coverage test (except for torch.nn.parallel) (#158654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158654 Approved by: https://github.com/janeyx99 ghstack dependencies: #158491	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	9e8f27cc79	[BE] Make torch.nn.modules.* satisfy the docs coverage test (#158491 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change) 3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs. #### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html) This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except - NLLLoss2d (deprecated) - Container (deprecated) - CrossMapLRN2d (what is this?) - NonDynamicallyQuantizableLinear This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491 Approved by: https://github.com/janeyx99	2025-07-25 22:03:55 +00:00
raghavhrishi	7ef3c3357d	NUMA binding integration with elastic agent and torchrun (#149334 ) Implements #148689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334 Approved by: https://github.com/d4l3k Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>	2025-07-25 21:19:49 +00:00
Joel Schlosser	316c188a5e	Remove torch.functional entries from the doc ignore list (#158581 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking. 3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs. This PR takes option (3) above and: * Removes all 20 `torch.functional` entries from the doc ignore list * Removes `torch.functional.align_tensors()` entirely, since we don't want to document it. * This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety. * Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581 Approved by: https://github.com/janeyx99	2025-07-25 17:19:01 +00:00
PyTorch MergeBot	c8316d0e79	Revert "[BE] Remove torch deploy \| remove torch deploy specific files (#158290 )" This reverts commit `6ed2cb6ccd`. Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
PyTorch MergeBot	a9f6770edd	Revert "[BE] Remove __reduce_deploy__ (#158291 )" This reverts commit `9c68c4d08f`. Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
Jeff Daily	9b29166f57	[ROCm] add flag torch.backends.miopen.immediate (#158951 ) The MIOpen integration has changed over the years. In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode. But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True. This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode. Thus, immediate mode needs its own flag. Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags. This PR adds the first miopen-only flag to control its immediate mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-07-25 04:01:51 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
Ti-Tai Wang	da35562bba	[ONNX] Filter out torchscript sentences (#158850 ) Fixes #157300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158850 Approved by: https://github.com/justinchuby, https://github.com/svekars	2025-07-24 20:59:06 +00:00
Wei (Will) Feng	693197eed6	[doc] remove FSDP1 developer note (#158991 ) this resolve pytorch doc audit - we remove fsdp1 doc and promote fsdp2 https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/158991 Approved by: https://github.com/svekars, https://github.com/mori360 ghstack dependencies: #158989	2025-07-24 08:21:54 +00:00
Wei (Will) Feng	68349118b5	[doc] add weifengpy to torch distributed pocs (#158989 ) <img width="415" height="355" alt="Screenshot 2025-07-23 at 16 02 12" src="https://github.com/user-attachments/assets/35b6bb45-d5ed-4d74-8369-e8e66aaa2618" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158989 Approved by: https://github.com/mori360	2025-07-24 04:42:33 +00:00
Mikayla Gawarecki	7f649ed4f8	Add basic torch.hash_tensor op (#154149 ) Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor. - The hash is always uint64. - Integers will be casted to uint64 before performing the xor_sum reduction - Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149 Approved by: https://github.com/albanD	2025-07-23 22:28:03 +00:00
fduwjj	82f8e04f27	Update distributed maintainers (#158900 ) I maintain couple components of distributed like devicemesh, c10d and PGNCCL, gloo, etc. Can I be marked not as emeritus? Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/158900 Approved by: https://github.com/albanD	2025-07-23 21:53:27 +00:00
PaliC	9c68c4d08f	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290	2025-07-23 20:27:28 +00:00
PaliC	6ed2cb6ccd	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD ghstack dependencies: #158288	2025-07-23 20:27:28 +00:00
drisspg	691736ae07	Add kernel options to flex docs (#158875 ) Fixes https://github.com/pytorch/pytorch/issues/158741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158875 Approved by: https://github.com/BoyuanFeng, https://github.com/albanD	2025-07-23 19:05:19 +00:00
Panagiotis Kourdis	fd47401536	[doc] Updates to distributed.md for XCCL backend (#155834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834 Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-22 21:01:43 +00:00
PyTorch MergeBot	6341311333	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit `2ad5c25cfc`. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/ZainRizvi due to Very sorry but this is still breaking internally. @albanD would you be able to help get this past the finish line? D78496124 has more details on the failure and the workaround might be to do something like what's in D78684669. To validate the fixes internally, you can follow the instructions here to ghimport the changes: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3100195370))	2025-07-22 01:01:41 +00:00
PyTorch MergeBot	4c18e85300	Revert "[BE] Remove torch deploy \| remove torch deploy specific files (#158290 )" This reverts commit `a6de309ca1`. Reverted https://github.com/pytorch/pytorch/pull/158290 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:39 +00:00
PyTorch MergeBot	920f26c761	Revert "[BE] Remove __reduce_deploy__ (#158291 )" This reverts commit `0b9fb91f17`. Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:38 +00:00
Jane Xu	7cc5d03dfc	Document the rest of the specific optimizer module APIs (#158669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158669 Approved by: https://github.com/albanD ghstack dependencies: #158483	2025-07-19 07:27:15 +00:00
Jane Xu	f73594164a	[BE] document Adadelta and Adagrad APIs properly (#158483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158483 Approved by: https://github.com/albanD	2025-07-19 07:27:15 +00:00
Svetlana Karslioglu	79e49efadd	Pull latest Sphinx theme (#158595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595 Approved by: https://github.com/albanD	2025-07-18 18:46:47 +00:00
PyTorch MergeBot	9a7c2f1f64	Revert "Add torch compile force disable caches alias (#158072 )" This reverts commit `2ecf083b72`. Reverted https://github.com/pytorch/pytorch/pull/158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](https://github.com/pytorch/pytorch/pull/158072#issuecomment-3086740829))	2025-07-18 04:58:24 +00:00
angelayi	66c9bc5062	[export] Add runnable code to export docs (#158506 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/158506/export.html Yay I can add runnable code to export docs now Also moved export API reference to a different file. With these changes, we can start to consolidate the [export tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) with the docs on pytorch docs. We just need to move the section on DDE and 0/1 specialization, and then I think we can delete the export tutorial. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158506 Approved by: https://github.com/pianpwk, https://github.com/svekars	2025-07-17 20:15:22 +00:00
Oguz Ulgen	2ecf083b72	Add torch compile force disable caches alias (#158072 ) Bunch of people keep thinking current alias only disables inductor cache because it has the name inductor in it. lets globalize the name Pull Request resolved: https://github.com/pytorch/pytorch/pull/158072 Approved by: https://github.com/ezyang	2025-07-17 15:40:36 +00:00
Jiang, Yanbing	f4d8bc46c7	Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520 ) ### Description This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`. ``` torch.backends.mkldnn.matmul.fp32_precision = 'tf32' torch.backends.mkldnn.conv.fp32_precision = "tf32" ``` Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-07-17 08:57:34 +00:00
PaliC	0b9fb91f17	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290	2025-07-17 05:56:26 +00:00
PaliC	a6de309ca1	[BE] Remove torch deploy \| remove torch deploy specific files (#158290 ) This PR removes specific files found in pytorch which are only used for torch::deploy. This is mostly testing code and a debugger. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158290 Approved by: https://github.com/albanD ghstack dependencies: #158288	2025-07-17 05:56:18 +00:00
Yu, Guangye	2ad5c25cfc	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-07-17 01:56:01 +00:00
Yiming Zhou	a9ee4250d5	[4/n] Remove references to TorchScript in PyTorch docs (#158317 ) Summary: jit.rst Test Plan: CI Rollback Plan: Differential Revision: D78309840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158317 Approved by: https://github.com/svekars, https://github.com/zhxchen17	2025-07-16 20:01:34 +00:00
angelayi	1cc62c2cb9	[export] Update docs (#157750 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html Changes: * Rename draft_export.md -> export.draft_export.md for consistency. * Removed non-strict section in export, instead pointed to programming model doc. * Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs. * Removed Specialization section in favor of programming model doc * Added pt2 archive doc * Cleaned up sidebar Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750 Approved by: https://github.com/pianpwk	2025-07-16 19:53:12 +00:00
Jiang, Yanbing	900fba4c07	Update warning of TF32 (#158209 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/158209 Approved by: https://github.com/jansel	2025-07-16 01:28:50 +00:00
Yiming Zhou	05dfd312cf	[3/n] Remove references to TorchScript in PyTorch docs (#158315 ) Summary: - cpp_index.rst - fx.md - jit_builtin_functions.rst - jit_python_reference.md - jit_unsupported.md cpu_threading large_scale_deployment Test Plan: CI Rollback Plan: Differential Revision: D78309320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158315 Approved by: https://github.com/svekars, https://github.com/zhxchen17	2025-07-15 21:14:18 +00:00
Yiming Zhou	0640cfa38c	[2/n] Remove references to TorchScript in PyTorch docs (#158306 ) Summary: Removed jit_language_reference.md Test Plan: CI Rollback Plan: Differential Revision: D78308133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158306 Approved by: https://github.com/svekars, https://github.com/zhxchen17	2025-07-15 20:57:23 +00:00
Yiming Zhou	19625daf88	[1/n] Remove references to TorchScript in PyTorch docs (#158305 ) Summary: Removed jit_language_reference_v2.md Test Plan: CI Rollback Plan: Differential Revision: D78308009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158305 Approved by: https://github.com/jingsh, https://github.com/svekars	2025-07-15 20:16:53 +00:00
Ti-Tai Wang	5606c516fd	[ONNX] Remove legacy Dort (#158258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158258 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-07-15 19:14:06 +00:00
Jason Ansel	31326a9ad7	Fix typo in torch.set_float32_matmul_precision docs (#158191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158191 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-07-12 18:23:11 +00:00
Ti-Tai Wang	2eff14c445	[ONNX] Delete torch.onnx.dynamo_export (#158130 ) It's deprecated since torch==2.7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158130 Approved by: https://github.com/justinchuby	2025-07-12 02:30:47 +00:00
Tristan Rice	0d77364ee3	dist2: cleanup non-option methods on PG (missing, timeouts) (#158123 ) This updates the ProcessGroup.* API to include timeouts on all non-option based overloaded methods. This also adds 2 missing ones `alltoall_base` and `barrier`. Following design in: https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89 Test plan: ``` pytest test/distributed/test_dist2.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158123 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2025-07-12 00:06:37 +00:00
Shivam Raikundalia	11d6ad8b2e	[Docs] Update PT2 Profiler Torch-Compiled Region Image (#158066 ) Summary: In Pytorch 2.5 we added source code attribution to PT2 traces. Each Torch-Compiled Region will now have its frame id and frame compile id associated with it. Update the image in the doc and add a description of this in the doc itself Test Plan: {F1980179183} Rollback Plan: Differential Revision: D78118228 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158066 Approved by: https://github.com/aaronenyeshi	2025-07-11 07:56:45 +00:00
zeshengzong	b4fc42ca80	Add `torch.segment_reduce` docs (#154352 ) Fixes #153138 ## Test Result ![image](https://github.com/user-attachments/assets/62346d62-d048-4259-906b-f8261e10b4cc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154352 Approved by: https://github.com/albanD	2025-07-11 06:16:38 +00:00
Jerry Zhang	11a86ad2fa	Remove pytorch quant docs since we are moving to torchao (#157766 ) Summary: att Test Plan: doc page generated from CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/157766 Approved by: https://github.com/Skylion007	2025-07-11 03:21:47 +00:00
Howard Huang	8532033679	RPC tutorial audit (#157938 ) Fix [T228333894](https://www.internalfb.com/intern/tasks/?t=228333894) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157938 Approved by: https://github.com/AlannaBurke	2025-07-10 14:15:37 +00:00
Dmitry Rogozhkin	b146ca74f0	docs: add get_default_backend_for_device to distributed documentation (#156783 ) `torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap. CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-07-10 05:11:30 +00:00
Tristan Rice	ed051c3084	torch.distributed: add initial _dist2 prototype API (#157841 ) This adds the initial dist2 API as proposed in https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89 This is a WIP experimental API and is a sandbox for a number of new features and quality of life improvements/changes to c10d. Test plan: ``` pytest test/distributed/test_dist2.py ``` Docs ``` cd docs make html ``` ![Screenshot 2025-07-08 at 13-39-23 Object Oriented Distributed API - torch distributed _dist2 — PyTorch main documentation](https://github.com/user-attachments/assets/9c03a7ec-09e5-42b9-8478-1ec28bc2b6bd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157841 Approved by: https://github.com/fduwjj	2025-07-09 23:40:43 +00:00
Dhia-naouali	eaf32fffb7	fixed a tiny typo in torch.compiler.md (#157462 ) Fixes #157444 there was a typo in [docs/source/torch.compiler.md](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler.md) : see -> seen Pull Request resolved: https://github.com/pytorch/pytorch/pull/157462 Approved by: https://github.com/Skylion007, https://github.com/svekars	2025-07-02 19:15:15 +00:00
PyTorch MergeBot	023887fc5a	Revert "Switch to standard pep517 sdist generation (#152098 )" This reverts commit `f16053f0c9`. Reverted https://github.com/pytorch/pytorch/pull/152098 on behalf of https://github.com/malfet due to IMO this PR needs to be split into few helper ones, with better test plan ([comment](https://github.com/pytorch/pytorch/pull/152098#issuecomment-3024223880))	2025-07-01 14:14:52 +00:00
Ti-Tai Wang	c174f3a6a5	[ONNX] Delete deprecated tutorial page link (#157310 ) Related to https://github.com/pytorch/tutorials/issues/3420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157310 Approved by: https://github.com/justinchuby	2025-07-01 01:18:26 +00:00
Klaus Zimmermann	f16053f0c9	Switch to standard pep517 sdist generation (#152098 ) Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now. Closes #150461. The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files. This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file. The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now. ## Issues ### sdist name According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`. ### Symbolic Links The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy. PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target. The list of symbolic links currently is as follows: <details> \|source\|target\|problem\|solution\| \|-\|-\|-\|-\| \| `.dockerignore` \| `.gitignore` \| ✅ ok (individual file) \|\| \| `docs/requirements.txt` \| `../.ci/docker/requirements-docs.txt` \|❗`..` in target\|swap source and target[^1]\| \| `functorch/docs/source/notebooks` \| `../../notebooks/` \|❗`..` in target\|swap source and target[^1]\| \| `.github/ci_commit_pins/triton.txt` \| `../../.ci/docker/ci_commit_pins/triton.txt` \| ✅ ok (omitted from sdist)\|\| \| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` \| `../../CONTRIBUTING.md` \|❗`..` in target\|omit from sdist[^2]\| \| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` \| `../../../../tests/DictionaryLookup` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/MyGame` \| `../../../../tests/MyGame` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/NamespaceA` \| `../../../../tests/namespace_test/NamespaceA` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/NamespaceC` \| `../../../../tests/namespace_test/NamespaceC` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/optional_scalars` \| `../../../../tests/optional_scalars` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/java/src/test/java/union_vector` \| `../../../../tests/union_vector` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` \| `../../../../java/src/main/java` \|❗`..` in target\|omit from sdist[^3]\| \| `third_party/ittapi/rust/ittapi-sys/c-library` \| `../../` \|❗`..` in target\|omit from sdist[^4]\| \| `third_party/ittapi/rust/ittapi-sys/LICENSES` \| `../../LICENSES` \|❗`..` in target\|omit from sdist[^4]\| \| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` \| `./pre-commit` \|✅ ok (individual file)\|\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` \| `../../push/tests/integration/sample_client.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` \| `../../pull/tests/integration/sample_server.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` \| `../../push/tests/integration/sample_client.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` \| `../../pull/tests/integration/sample_server.cc` \|❗`..` in target\|omit from sdist[^5]\| \| `third_party/XNNPACK/tools/xngen` \| `xngen.py` \| ✅ ok (individual file)\|\| </details> The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`. [^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well. [^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul. [^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist. [^4]: We don't need to ship the rust bindings for ittapi. [^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted. ### Nccl Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`. This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download. However, it is possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098 Approved by: https://github.com/atalman	2025-06-30 19:07:34 +00:00
Saiteja Samudrala	2796f31b5e	[DCP] OSS Zero Overhead Checkpointing Implementation (#156207 ) Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: https://github.com/pytorch/torchtitan/pull/1287 Differential Revision: D72391401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156207 Approved by: https://github.com/teja-rao	2025-06-29 03:19:48 +00:00
Justin Chu	5692cbb818	[ONNX] Delete symbolic caffe2 (#157102 ) Caffe2 is removed from pytorch. This is a clean up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157102 Approved by: https://github.com/titaiwangms, https://github.com/cyyever	2025-06-28 05:22:02 +00:00
Jane Xu	4048a144ab	Address richard's comments on libtorch_stable_abi note (#156324 ) Followups from #155984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156324 Approved by: https://github.com/zou3519	2025-06-27 19:19:12 +00:00
Svetlana Karslioglu	2860f5c4f5	Remove mentioning of TorchScript in Export doc (#156969 ) Remove mentioning of TorchScript Pull Request resolved: https://github.com/pytorch/pytorch/pull/156969 Approved by: https://github.com/angelayi Co-authored-by: Angela Yi <yiangela7@gmail.com>	2025-06-27 17:59:15 +00:00
rzou	aa2d54148d	Add AOTDispatcher config to set backward autocast behavior (#156356 ) This PR adds a new config `backward_pass_autocast`, to set the backward autocast behavior. It does not change the existing behavior. The reason why we need this is that torch.compile acquires a forward and backward graph at the time of the forward pass. This means that implemented naively, if there are any context managers active outside the call to torch.compile, the backward graph will also get the behaviors from those context managers. This PR gives users a way to tweak the autocast behavior of the backward pass. Please see torch._functorch.config for the options to the `backward_pass_autocast` config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156356 Approved by: https://github.com/bdhirsh ghstack dependencies: #155354	2025-06-27 14:58:58 +00:00
Vasiliy Kuznetsov	414ad47045	revamp dtype documentation for 2025 (#156087 ) The dtype documentation has not been updated in awhile, let's do a revamp. 1. combine the duplicated docs for dtypes from `tensors.rst` and `tensor_attributes.rst` to live in `tensor_attributes.rst`, and link to that page from `tensors.rst` 2. split the dtype table into floating point and integer dtypes 3. add the definition of shell dtype 4. add the float8 and MX dtypes as shell dtypes to the dtype table 5. remove legacy quantized dtypes from the table 6. add the definition of various dtype suffixes ("fn", etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156087 Approved by: https://github.com/albanD	2025-06-27 13:10:23 +00:00
Geevarghese George	7f6e7103a3	Convert to markdown: jit_python_reference.rst, jit_unsupported.rst, jit_utils.rst, library.rst (#155404 ) Fixes #155024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155404 Approved by: https://github.com/svekars	2025-06-26 21:09:46 +00:00
haozhe.zhu	53e0b9c393	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-06-26 10:32:20 +00:00
Mikayla Gawarecki	2c6324a1eb	Delete sections referencing torchscript in serialization docs (#156648 ) Address [T228333890](https://www.internalfb.com/intern/tasks/?t=228333890) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156648 Approved by: https://github.com/svekars	2025-06-25 23:41:24 +00:00
albanD	a25d1443fa	Mark TorchServe as all emeritus (#156865 ) As per title and to follow the broader tutorial cleanup work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156865 Approved by: https://github.com/svekars, https://github.com/malfet, https://github.com/seemethere	2025-06-25 23:34:57 +00:00
Yu, Guangye	9b498d3bb2	Update docs for torch.device (#156686 ) # Motivation Update the doc, to make `torch.device`'s constructor officially support the following methods: - A device string, which is a string representation of the device type and optionally the device ordinal. - A device type and a device ordinal. - A device ordinal, which is treated as the current accelerator type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156686 Approved by: https://github.com/albanD	2025-06-25 02:12:36 +00:00
PyTorch MergeBot	6459a5c7a9	Revert "Add unified memory APIs for torch.accelerator (#152932 )" This reverts commit `35e44067c4`. Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))	2025-06-25 00:11:35 +00:00
Yu, Guangye	35e44067c4	Add unified memory APIs for torch.accelerator (#152932 ) # Motivation The following API will be put under torch.accelerator - empty_cache - max_memory_allocated - max_memory_reserved - memory_allocated - memory_reserved - memory_stats - reset_accumulated_memory_stats - reset_peak_memory_stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932 Approved by: https://github.com/albanD ghstack dependencies: #138222	2025-06-24 07:57:48 +00:00
Edward Z. Yang	17eb649d55	Implement guard collectives (optimized version) (#156562 ) This is a remix of https://github.com/pytorch/pytorch/pull/155558 Instead of mediating guard collective via a config option, in this one it's done via a `set_stance` like API. The motivation is that checking for the config value on entry on torch.compile is apparently quite expensive, according to functorch_maml_omniglot. So this makes it a bit cheaper. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156562 Approved by: https://github.com/Microve	2025-06-24 04:59:49 +00:00
Animesh Jain	fab85fc5f9	[compile][hierarchical compilation] Release nested_compile_region API (#156449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156449 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-06-21 15:14:59 +00:00
Xuehai Pan	2ccfd14e23	[BE] fix typos in docs/ (#156080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-06-21 02:47:32 +00:00
Justin Chu	fbbab794ef	[ONNX] Implement Attention-23 (#156431 ) Implement Attention-23 using sdpa and flexattention. - I used copilot for this. - Also updated the conversion logic to remove trailing None inputs. @gramalingam @kunal-vaishnavi @titaiwangms Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431 Approved by: https://github.com/titaiwangms Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-20 23:54:57 +00:00
Bhagirath Mehta	de1930a429	Add ONNX dynamo metadata documentation (#155816 ) Describe auto-generated metadata when calling torch.onnx.export Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-20 20:12:22 +00:00
sekyondaMeta	18e4c461fb	Update index.md (#155143 ) Related to: https://github.com/pytorch/pytorch/issues/152134 Update to index.md to add language for Stable and Unstable Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143 Approved by: https://github.com/AlannaBurke, https://github.com/atalman Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-20 18:53:32 +00:00
windsonsea	9944cd0949	Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520 ) Related to #155032 - ✅ quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html) - ✅ quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html) - ✅ quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html) - ✅ random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-18 18:46:04 +00:00
nirajkamalk	202d2ae53a	Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430 ) Fixes #155033 - [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst) - [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst) - [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst) - [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size. - [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-18 01:27:04 +00:00
Jane Xu	e8bfce9a43	Document how to use stack-based APIs with StableIValue (#155984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984 Approved by: https://github.com/albanD, https://github.com/zou3519	2025-06-18 01:10:23 +00:00
PyTorch MergeBot	fa4f07b5b8	Revert "[Docs] Convert to markdown to fix 155032 (#155520 )" This reverts commit `cd66ff8030`. Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))	2025-06-17 22:22:50 +00:00
windsonsea	cd66ff8030	[Docs] Convert to markdown to fix 155032 (#155520 ) Fix #155032 - ✅ quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html) - ✅ quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html) - ✅ quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html) - ✅ quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html) - ✅ random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-17 20:29:45 +00:00
Svetlana Karslioglu	fc5ae12293	Fix issue with right-nav (#156119 ) Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example: * Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty. * Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed <img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" /> vs <img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119 Approved by: https://github.com/albanD	2025-06-17 18:09:51 +00:00
windsonsea	7fcad0231c	[Docs] Convert to markdown to fix 155025 (#155789 ) Related to #155025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155789 Approved by: https://github.com/svekars	2025-06-17 15:08:14 +00:00
Wei (Will) Feng	4a8f5e752b	[FSDP2] explain user contract for fully_shard (#156070 ) <img width="896" alt="Screenshot 2025-06-16 at 1 36 00 AM" src="https://github.com/user-attachments/assets/7cdea256-2454-49c7-8b32-24549a13134d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156070 Approved by: https://github.com/mori360	2025-06-17 10:03:19 +00:00
Justin Silver	008345be9d	Fix #155018 (convert distributed rst to markdown) (#155528 ) Used [rst2myst tool](https://rst-to-myst.readthedocs.io/en/latest/) Fixes #155018 Docs comparison (check out the 'new' whenever docs build) 1. distributed.checkpoint ([old](https://docs.pytorch.org/docs/main/distributed.checkpoint.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.checkpoint.html)) 2. distributed.elastic ([old](https://docs.pytorch.org/docs/main/distributed.elastic.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.elastic.html)) 3. distributed.fsdp.fully_shard ([old](https://docs.pytorch.org/docs/main/distributed.fsdp.fully_shard.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.fsdp.fully_shard.html)) 4. distributed.optim ([old](https://docs.pytorch.org/docs/main/distributed.optim.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.optim.html)) 5. distributed.pipelining ([old](https://docs.pytorch.org/docs/main/distributed.pipelining.html) vs. [new](https://docs-preview.pytorch.org/pytorch/pytorch/155528/distributed.pipelining.html)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155528 Approved by: https://github.com/wz337, https://github.com/svekars	2025-06-16 20:46:09 +00:00
Julian De la Barrera Brandner	2dc1627451	[doc] Add documentation for division by zero behavior in autograd (#155987 ) Fixes #128796 This PR adds documentation about the behavior of division by zero operations in PyTorch's autograd system. The documentation explains: 1. How division by zero produces `inf` values following IEEE-754 floating point arithmetic 2. How autograd handles these cases and why masking after division can lead to `nan` gradients 3. Provides concrete examples showing the issue 4. Recommends two solutions: - Masking before division - Using MaskedTensor (experimental API) The documentation is added to the autograd notes section, making it easily discoverable for users who encounter this common issue. This addresses the original issue #128796 which requested better documentation of this behavior to help users avoid common pitfalls when dealing with division by zero in their models. dditional changes: - Fixed formatting consistency by replacing curly apostrophes with straight apostrophes in the existing documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/155987 Approved by: https://github.com/soulitzer Co-authored-by: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com>	2025-06-16 19:02:12 +00:00
windsonsea	fbd88ae2b5	Convert to markdown: checkpoint.rst (#156009 ) Related to #155014 Use two commits to have a try. ```bash 1800 git mv docs/source/checkpoint.rst docs/source/checkpoint.md 1802 git commit -m "[Docs] Rename checkpoint.rst" 1803 git push origin ckpoint # update the markdown file 1805 git add . 1806 git commit -m "modify checkpoint.md" 1807 git push origin ckpoint ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156009 Approved by: https://github.com/svekars	2025-06-16 17:48:23 +00:00
windsonsea	a10024d7de	Convert complex_numbers.rst to markdown (#156039 ) Related to #155014 Have a try by following https://github.com/pytorch/pytorch/pull/155899#issuecomment-2974715750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156039 Approved by: https://github.com/svekars	2025-06-16 17:24:37 +00:00
Dhia naouali	c620d0b5c7	convert: rst to myst pr2/2 (#155911 ) Fixes #155038 parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check) this PR converts the following three .rst files with the mentioned referenced in each file - [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst) - torch.compiler_troubleshooting - nonsupported_numpy_feats - torchdynamo_fine_grain_tracing - [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst) - None - [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst) - torch.compiler_overview - torch.compiler_api - torchdynamo_fine_grain_tracing I made the suggested edits by the maintainers as commented in the parent PR (used git mv on all files, yet it still appeared as delete-create action) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-16 00:44:44 +00:00
Animesh Jain	54976bca10	[dynamo] Provide helper functions for guard filter hook (#155083 ) Collection of ready-made guard filters. One issue is that they are not composable - `filter1(filter2(guard))`. On the other hand, they are easy to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155083 Approved by: https://github.com/zhxchen17, https://github.com/jansel	2025-06-15 17:49:36 +00:00
Edgar Romo Montiel	ca3cabd24a	Convert to markdown: named_tensor.rst, nested.rst, nn.attention.bias.rst, nn.attention.experimental.rst, nn.attention.flex_attention.rst #155028 (#155696 ) Fixes #155028 This pull request updates the documentation by transitioning from .rst to .md format. It introduces new Markdown files for the documentation of named_tensor, nested, nn.attention.bias, nn.attention.experimental, and nn.attention.flex_attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/155696 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-14 03:32:00 +00:00
dggaytan	3003c681ef	Converting .rst files to .md files (#155377 ) Fixes #155036 This pull request updates the documentation for several modules by transitioning from .rst to .md format, improving readability and usability. It introduces new Markdown files for the documentation of torch.ao.ns._numeric_suite, torch.ao.ns._numeric_suite_fx, AOTInductor, AOTInductor Minifier, and the torch.compiler API Pull Request resolved: https://github.com/pytorch/pytorch/pull/155377 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-06-13 22:54:27 +00:00

1 2 3 4 5 ...

3475 Commits