Dont't GC as often when collecting cudagraphs (#158193)

TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`. We were previously garbage collecting at the beginning of each cudagraph capture. vLLM collects 5427 graphs and most of those garbage collections weren't actually collecting any memory (CPU or GPU). This changes it to not collect more than every 10s so if we're capturing in a loop we don't burn all our cycles looking for garbage. (These number have a lot of variance from run to run but give the correct general scale) ``` | calls | total | synchronize | gcs | collect | empty cache | sys freed | cuda freed | -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ before | 5427 | 78s | 1.48s | 5427 | 53.22s | 1.21s | 145855 | 1539309568 | -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ after | 5427 | 24s | 0s | 3 | 1.53s | 0.84s | 592 | 1539309568 | -------+-------+-------+-------------+------+---------+-------------+-----------+------------+ ``` total - this is the total time reported by vLLM's "Graph capturing finished" log. The rest of these are measured in torch.cuda.graphs.graph.__enter__(): calls - number of times torch.cuda.graphs.graph.__enter__ was called synchronize - this is the duration taken by the cuda.synchronize call gcs - number of times gc.collect was called collect - this is the duration taken by the gc.collect call empty cache - this is the duration taken by the torch.cuda.empty_cache call sys freed - the number of bytes reported freed by gc.collect cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is fairly quick. Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this): <img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193 Approved by: https://github.com/ngimel
2025-12-06 12:20:52 +01:00 · 2025-07-17 13:19:05 -07:00 · 2025-07-17 13:19:05 -07:00 · e20736bf1d
commit e20736bf1d
parent cae4746952
3 changed files with 18 additions and 2 deletions
--- a/torch/compiler/config.py
+++ b/torch/compiler/config.py
@ -88,4 +88,12 @@ This whitelist is dominant over all other flags dynamic=False, force_nn_module_p
 and force_parameter_static_shapes.
 """

+# force a python GC before recording cudagraphs
+force_cudagraph_gc: bool = Config(env_name_default="TORCH_CUDAGRAPH_GC", default=True)
+"""
+If True (the backward-compatible behavior) then gc.collect() before recording
+any cudagraph.
+"""
+
+
 install_config_module(sys.modules[__name__])
--- a/torch/cuda/graphs.py
+++ b/torch/cuda/graphs.py
@ -233,7 +233,15 @@ class graph:
    def __enter__(self) -> None:
        # Free as much memory as we can for the graph
        torch.cuda.synchronize()
-        gc.collect()
+
+        if torch.compiler.config.force_cudagraph_gc:
+            # Originally we unconditionally garbage collected here. On one hand
+            # that's nice because we have a chance to collect more memory, but
+            # on the other hand it is REALLY expensive, especially for doing
+            # multiple cudagraph captures in a row. In theory it will only help
+            # when a dead python cycle is holding onto CUDA memory.
+            gc.collect()
+
        torch.cuda.empty_cache()

        # Stackoverflow seems comfortable with this pattern
--- a/torch/onnx/_internal/_lazy_import.py
+++ b/torch/onnx/_internal/_lazy_import.py
@ -28,7 +28,7 @@ class _LazyModule:
 # NOTE: Add additional used imports here.
 if TYPE_CHECKING:
    import onnx
-    import onnx_ir  # type: ignore[import-untyped, import-not-found]
+    import onnx_ir  # type: ignore[import-untyped]
    import onnxscript
    import onnxscript._framework_apis.torch_2_8 as onnxscript_apis