pytorch/docs/source/notes
Frank Lin bec6541d84 [CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186)
Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3215947565)

This PR removes capture/reply overhead while preserving the memory savings:

1. **Terminals as free markers**
   We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged.

2. **Incremental, cached reachability**
   We add a **per-graph reuse context** that caches reverse-traversal state:

   * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier.
   * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes.
   * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work.

See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162186
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-09-30 22:28:46 +00:00
..
amp_examples.rst
autograd.rst [doc] Add documentation for division by zero behavior in autograd (#155987) 2025-06-16 19:02:12 +00:00
broadcasting.rst Fix comment on broadcasting example to clarify dimension mismatch (#162177) 2025-09-29 16:47:48 +00:00
cpu_threading_torchscript_inference.rst [3/n] Remove references to TorchScript in PyTorch docs (#158315) 2025-07-15 21:14:18 +00:00
cuda.rst [CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186) 2025-09-30 22:28:46 +00:00
custom_operators.rst
ddp.rst
extending.func.rst
extending.rst [autograd][docs] Add more details on why save_for_backward is important in extending autograd note (#153005) 2025-05-09 16:36:57 +00:00
faq.rst
get_start_xpu.rst update supported OS for Intel client GPU (#161699) 2025-09-01 05:45:09 +00:00
gradcheck.rst [BE] fix typos in docs/ (#156080) 2025-06-21 02:47:32 +00:00
hip.rst [ROCm] Ck backend UX refactor (#152951) 2025-08-08 18:40:17 +00:00
large_scale_deployments.rst [3/n] Remove references to TorchScript in PyTorch docs (#158315) 2025-07-15 21:14:18 +00:00
libtorch_stable_abi.md Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557) 2025-08-19 22:13:47 +00:00
mkldnn.rst Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520) 2025-07-17 08:57:34 +00:00
modules.rst
mps.rst
multiprocessing.rst [BE] fix typos in docs/ (#156080) 2025-06-21 02:47:32 +00:00
numerical_accuracy.rst Update warning of TF32 (#158209) 2025-07-16 01:28:50 +00:00
out.rst add Out Notes (#151306) 2025-04-24 20:25:09 +00:00
randomness.rst
serialization.rst Delete sections referencing torchscript in serialization docs (#156648) 2025-06-25 23:41:24 +00:00
windows.rst Removing conda references from PyTorch Docs (#152702) 2025-05-20 20:33:28 +00:00