pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Frank Lin bec6541d84 [CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186 ) Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3215947565) This PR removes capture/reply overhead while preserving the memory savings: 1. Terminals as free markers We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged. 2. Incremental, cached reachability We add a per-graph reuse context that caches reverse-traversal state: * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier. * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes. * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work. See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162186 Approved by: https://github.com/eqy, https://github.com/ngimel		2025-09-30 22:28:46 +00:00
..
amp_examples.rst
autograd.rst	[doc] Add documentation for division by zero behavior in autograd (#155987 )	2025-06-16 19:02:12 +00:00
broadcasting.rst	Fix comment on broadcasting example to clarify dimension mismatch (#162177 )	2025-09-29 16:47:48 +00:00
cpu_threading_torchscript_inference.rst	[3/n] Remove references to TorchScript in PyTorch docs (#158315 )	2025-07-15 21:14:18 +00:00
cuda.rst	[CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186 )	2025-09-30 22:28:46 +00:00
custom_operators.rst
ddp.rst
extending.func.rst
extending.rst	[autograd][docs] Add more details on why save_for_backward is important in extending autograd note (#153005 )	2025-05-09 16:36:57 +00:00
faq.rst
get_start_xpu.rst	update supported OS for Intel client GPU (#161699 )	2025-09-01 05:45:09 +00:00
gradcheck.rst	[BE] fix typos in docs/ (#156080 )	2025-06-21 02:47:32 +00:00
hip.rst	[ROCm] Ck backend UX refactor (#152951 )	2025-08-08 18:40:17 +00:00
large_scale_deployments.rst	[3/n] Remove references to TorchScript in PyTorch docs (#158315 )	2025-07-15 21:14:18 +00:00
libtorch_stable_abi.md	Add ScalarType -> shim conversion, add stable::Tensor.scalar_type (#160557 )	2025-08-19 22:13:47 +00:00
mkldnn.rst	Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520 )	2025-07-17 08:57:34 +00:00
modules.rst
mps.rst
multiprocessing.rst	[BE] fix typos in docs/ (#156080 )	2025-06-21 02:47:32 +00:00
numerical_accuracy.rst	Update warning of TF32 (#158209 )	2025-07-16 01:28:50 +00:00
out.rst	add Out Notes (#151306 )	2025-04-24 20:25:09 +00:00
randomness.rst
serialization.rst	Delete sections referencing torchscript in serialization docs (#156648 )	2025-06-25 23:41:24 +00:00
windows.rst	Removing conda references from PyTorch Docs (#152702 )	2025-05-20 20:33:28 +00:00