[profiler] enable CUPTI range profiler in build (#125685)

Fixes #125272 ## About (This is a re-spin of PR #106617) Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) added in PR[75616](https://github.com/pytorch/pytorch/pull/75616). Support for the range profiler mode was disabled as we had to link with a NV PerfWorks library (`libnvperf_host.so`). This PR adds that link. The change includes- * Updates cmake build files to find `libnvperf_host.so` and set `CUDA_nvperf_host_LIBRARY` * WIP use the above cmake variable in kineto, will update this PR after kineto PR has landed See https://github.com/pytorch/kineto/pull/724 ## Example usage of CUPTI profiler The code snippet below shows how to configure pytorch profiler in CUPTI Profiler mode. Any code included in profiling window with be profiler by CUPTI/Kineto. Note how the `_ExperimentalConfig` struct is used to configure profiler metrics ``` with torch.profiler.profile( activities=[torch.profiler.ProfilerActivity.CUDA], record_shapes=True, on_trace_ready=trace_handler, experimental_config=torch.profiler._ExperimentalConfig( profiler_metrics=[ "kineto__tensor_core_insts", "dram__bytes_read.sum", "dram__bytes_write.sum"], profiler_measure_per_kernel=False), ) as prof: res = train_batch(modeldef) prof.step() ``` For a full example see this [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80) gist. ### Details of how to configure CUPTI profielr The` _Experimental` config structure can be used to pass metrics to profiler ``` profiler_metrics : a list of CUPTI profiler metrics used to measure GPU performance events. Any metric supported by CUPTI can be used, see here= https://docs.nvidia.com/cupti/r_main.html#r_profiler There are two special alias metrics `kineto__tensor_core_insts` and `kineto__cuda_core_flops` for FLOPS counting. profiler_measure_per_kernel (bool) : whether to profile metrics per kernel or for the entire measurement duration. ``` ## Testing Built from source with kineto [PR](https://github.com/pytorch/kineto/pull/724) ``` $> USE_CUDA=1 python setup.py install -- CUDA_cupti_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libcupti.so -- CUDA_nvperf_host_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libnvperf_host.so ``` Then run example [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80). This only works on V100+ GPUs only. Adding logs for debugging etc. ``` >$ export KINETO_LOG_LEVEL=1 >$ python xor.py INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:167] CUDA versions. CUPTI: 16; Runtime: 11060; Driver: 11040 Log file: /tmp/libkineto_activities_1683060.json Trace start time: 2023-02-11 19:11:47 Trace duration: 500ms Warmup duration: 0s Max GPU buffer size: 128MB Enabled activities: cuda_profiler_range Cupti Profiler metrics : kineto__tensor_core_insts, dram__bytes_read.sum, dram__bytes_write.sum Cupti Profiler measure per kernel : 0 Cupti Profiler max ranges : 10 INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:638] Enabling GPU tracing INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:567] Running child profiler CuptiRangeProfiler for 500 ms INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:104] Configuring 3 CUPTI metrics INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] sm__inst_executed_pipe_tensor.sum INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_read.sum INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_write.sum INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:575] Running child profiler CuptiRangeProfiler for 500 ms INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:672] Tracing starting in 9s INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:677] Tracing will end in 10s STAGE:2023-02-11 19:11:37 1683060:1683060 ActivityProfilerController.cpp:310] Completed Stage: Warm Up INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:693] Starting child profiler session ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125685 Approved by: https://github.com/sraikund16
2025-12-06 12:20:52 +01:00 · 2024-05-08 02:34:28 +00:00 · 2024-05-08 02:34:28 +00:00 · 2deea9e6e9
commit 2deea9e6e9
parent 9fedf41b60
2 changed files with 58 additions and 0 deletions
--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
@ -1841,6 +1841,8 @@ if(USE_KINETO)
      set(CUPTI_LIB_NAME "cupti.lib")
    endif()

+    set(NVPERF_HOST_LIB_NAME "libnvperf_host.so")
+
    find_library(CUPTI_LIBRARY_PATH ${CUPTI_LIB_NAME} PATHS
        ${CUDA_SOURCE_DIR}
        ${CUDA_SOURCE_DIR}/extras/CUPTI/lib64
@ -1855,13 +1857,27 @@ if(USE_KINETO)
        ${CUDA_SOURCE_DIR}/include
        NO_DEFAULT_PATH)

+    find_library(NVPERF_HOST_LIBRARY_PATH ${NVPERF_HOST_LIB_NAME} PATHS
+        ${CUDA_SOURCE_DIR}
+        ${CUDA_SOURCE_DIR}/lib
+        ${CUDA_SOURCE_DIR}/lib64
+        ${CUDA_SOURCE_DIR}/extras/CUPTI/lib64
+        NO_DEFAULT_PATH)
+
    if(CUPTI_LIBRARY_PATH AND CUPTI_INCLUDE_DIR)
      message(STATUS "  CUPTI_INCLUDE_DIR = ${CUPTI_INCLUDE_DIR}")
      set(CUDA_cupti_LIBRARY ${CUPTI_LIBRARY_PATH})
      message(STATUS "  CUDA_cupti_LIBRARY = ${CUDA_cupti_LIBRARY}")
+      # CUPTI Range Profiler requires the NVPerf library
+      # for configuring metrics
+      if(NVPERF_HOST_LIBRARY_PATH)
+        set(CUDA_nvperf_host_LIBRARY ${NVPERF_HOST_LIBRARY_PATH})
+        message(STATUS "  CUDA_nvperf_host_LIBRARY = ${NVPERF_HOST_LIBRARY_PATH}")
+      endif()
      message(STATUS "Found CUPTI")
      set(LIBKINETO_NOCUPTI OFF CACHE STRING "" FORCE)

+
      # I've only tested this sanity check on Linux; if someone
      # runs into this bug on another platform feel free to
      # generalize it accordingly
--- a/test/profiler/test_profiler.py
+++ b/test/profiler/test_profiler.py
@ -699,6 +699,48 @@ class TestProfiler(TestCase):
        if torch.cuda.is_available():
            check_metrics(stats, "device_memory_usage", deallocs=["[memory]"])

+    @unittest.skipIf(not kineto_available(), "Kineto is required")
+    @unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
+    def test_kineto_cupti_range_profiler(self):
+        """CUPTI provides a newer Profiling API from CUDA 10.0 that enables measuring
+        performance events for the GPU. This is supported as an experimental pytorch profiler feature.
+        Read more here https://docs.nvidia.com/cupti/r_main.html#r_profiler.
+        """
+        exp_config = _ExperimentalConfig(
+            profiler_metrics=[
+                # Metrics list at https://docs.nvidia.com/cupti/r_main.html#r_profiler
+                # or use kineto__tensor_core_insts, kineto__cuda_core_flops
+                "kineto__tensor_core_insts",
+                "dram__bytes_read.sum",
+                "dram__bytes_write.sum",
+            ],
+            profiler_measure_per_kernel=True,
+        )
+        with _profile(
+            use_cuda=True, use_kineto=True, experimental_config=exp_config
+        ) as p:
+            self.payload(use_cuda=True)
+
+        def check_trace(fname):
+            with open(fname) as f:
+                trace = json.load(f)
+                self.assertTrue("traceEvents" in trace)
+                events = trace["traceEvents"]
+                found_cupti_profiler_events = False
+                for evt in events:
+                    self.assertTrue("name" in evt)
+                    if "__cupti_profiler__" in evt["name"]:
+                        found_cupti_profiler_events = True
+                # PyTorch OSS CI runs in docker containers where the Range Profiler
+                # does not have sufficient privilege level (CUPTI_ERROR_INSUFFICIENT_PRIVILEGES).
+                # We can check that the profiler does not crash the job and the trace is not
+                # malformed, however do not check the actual presence of data.
+                self.assertTrue(1 or found_cupti_profiler_events)
+
+        with TemporaryFileName(mode="w+") as fname:
+            p.export_chrome_trace(fname)
+            check_trace(fname)
+
    @unittest.skipIf(
        IS_JETSON, "Jetson has a guard against OOM since host and gpu memory are shared"
    )