Commit Graph

660 Commits

Author SHA1 Message Date
Jeff Daily
d401e4e70a [ROCm][CUDA] add unit test utility busy_wait_for_flag (#166218)
torch.cuda._busy_wait_for_flag() will launch a kernel that spins until a flag is set by a corresponding torch.cuda._clear_flag(). These **must** be run on separate streams or it will deadlock.

When used correctly these kernels will put work on the GPU that is more predictable than torch.cuda._sleep() in cases where the unit test is depending on the GPU being busy.

Fixes #120318.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166218
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-29 22:40:23 +00:00
Way Wang
4a94591321 filter out alloc-free pairs from trace plot (#165752)
Summary:
When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze.
This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view.
The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752
Approved by: https://github.com/zdevito
2025-10-29 12:44:54 +00:00
Maggie Moss
84fe848503 Fix pyrefly error syntax (2/n) (#166448)
Ensrues pyrefly ignores only silence one error code.

After this, only ~40 files left to clean up .

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166448
Approved by: https://github.com/Skylion007
2025-10-29 00:36:40 +00:00
Yuanyuan Chen
a60d9e1f6d Fix flake8 B028 warnings (#166224)
This PR fixes flake8 B028 warning by specifying stacklevel=2 in `warnings.warn`. The advantage is that users can know more contextual information about PyTorch warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166224
Approved by: https://github.com/ezyang
2025-10-26 06:18:55 +00:00
Maggie Moss
eb83c3ca23 Clean up unused Pyrefly suppressions (#166178)
Cleaning up ignores that are no longer needed in the repo and adding select suppressions so the main branch is clean.

test plan:
`lintrunner -a`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166178
Approved by: https://github.com/oulgen
2025-10-25 05:32:21 +00:00
Eddie Yan
e64a814ae7 [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/kwen2501

Co-authored-by: drisspg <drisspguessous@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-22 21:38:52 +00:00
Maggie Moss
0be0de4ffa Add type suppressions to _inductor/runtime (#165918)
Original PR that did this was reverted due to merge conflicts.

Trying it again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165918
Approved by: https://github.com/oulgen
2025-10-21 02:54:22 +00:00
Yu, Guangye
b2f5c25b27 Introduce a generic API torch._C._accelerator_setAllocatorSettings (#165291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165291
Approved by: https://github.com/albanD
ghstack dependencies: #165288, #165289
2025-10-19 15:34:36 +00:00
Shivam Raikundalia
a25a649e70 [Mem Snapshot] Add Metadata Field (#165490)
Summary:
The implementation adds the ability to:

Set custom metadata strings that will be attached to all subsequent allocations
Clear or change the metadata at any point
View the metadata in memory snapshots via _dump_snapshot()

Test Plan: Added test in test_cuda.py and check manually in snapshot to see that metadata was added.

Differential Revision: D84654933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165490
Approved by: https://github.com/yushangdi
2025-10-17 23:46:02 +00:00
PyTorch MergeBot
11e2084308 Revert "[Mem Snapshot] Add Metadata Field (#165490)"
This reverts commit 5b3ea75895.

Reverted https://github.com/pytorch/pytorch/pull/165490 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165490#issuecomment-3413491091))
2025-10-17 02:01:53 +00:00
Shivam Raikundalia
5b3ea75895 [Mem Snapshot] Add Metadata Field (#165490)
Summary:
The implementation adds the ability to:

Set custom metadata strings that will be attached to all subsequent allocations
Clear or change the metadata at any point
View the metadata in memory snapshots via _dump_snapshot()

Test Plan: Added test in test_cuda.py and check manually in snapshot to see that metadata was added.

Differential Revision: D84654933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165490
Approved by: https://github.com/yushangdi
2025-10-16 22:54:27 +00:00
Sarthak Tandon
66ea76ec44 [ROCm][tunableop] Improvements to tunableop Numerical Check (#163079)
Modified the flag PYTORCH_TUNABLEOP_NUMERICAL_CHECK, so that it accepts the numerical tolerances in the format atol_rtol as compared to the previous 0 and 1. Retains previous functionality with default values as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163079
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-15 22:26:47 +00:00
Sarthak Tandon
7f9b745494 [ROCm][tunableop] Modified Online Tuning Mode to add Instant Logging (#163965)
- Added instant logging in online tuning mode, so that each tuned GEMM is instantly written
- Allows us to have saved tuning configs, in cases of crashes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163965
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-15 20:02:31 +00:00
Raman Kumar
df26c51478 error message for instantiating CUDA Stream if CUDA not available (#159868)
Fixes #159744
Summary:
```
import torch

# Generate input data
input_tensor = torch.randn(3, 3)
stream = torch.cuda.Stream()

# Call the API
input_tensor.record_stream(stream)
```

⚠️ will now show an error message
`torch.cuda.Stream requires CUDA support`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159868
Approved by: https://github.com/malfet, https://github.com/isuruf
2025-10-11 23:21:35 +00:00
PyTorch MergeBot
1e42fde45e Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 746fe78ecd.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))
2025-10-07 20:51:22 +00:00
Eddie Yan
746fe78ecd [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-06 23:11:23 +00:00
Yuanyuan Chen
48b54b45d6 Replace pynvml with nvidia-ml-py in win-test.sh (#164681)
pynvml was deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164681
Approved by: https://github.com/Aidyn-A, https://github.com/eqy
2025-10-06 21:57:26 +00:00
Yuanyuan Chen
35c4130fd1 [2/N] Fix ruff warnings (#164460)
Apply ruff `SIM` rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460
Approved by: https://github.com/ezyang
2025-10-04 03:40:32 +00:00
Maggie Moss
f414aa8e0d Add pyrefly suppressions (3/n) (#164588)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: uncomment lines in the pyrefly.toml file
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/bb31574ac8a59893c9cf52189e67bb2d

after:

 0 errors (1,970 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164588
Approved by: https://github.com/oulgen
2025-10-03 22:03:03 +00:00
PyTorch MergeBot
8ec8c14ace Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 3c59351c6e.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))
2025-10-03 20:15:56 +00:00
Eddie Yan
3c59351c6e [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-03 18:59:12 +00:00
Yuanyuan Chen
315ffdc1e4 [4/N] Apply ruff UP035 rule to python code (#164206)
Follows #164104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164206
Approved by: https://github.com/albanD
2025-10-01 19:05:53 +00:00
Yuanyuan Chen
e30f01b5b5 [1/N] Simplify "in" operation for containers of a single item (#164224)
These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164224
Approved by: https://github.com/rec, https://github.com/Skylion007
2025-09-30 19:59:43 +00:00
FFFrog
6ba83e06a5 [AMP] Add deprecated decorator for torch.xxx.amp.autocast class (#163654)
As the title stated.

**Changes:**
- torch.cuda.amp.autocast
- torch.cpu.amp.autocast
- add explicit `__new__` and `__init_subclass__` for those class above for inspect.signature to retrieve correct signature

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163654
Approved by: https://github.com/Skylion007
2025-09-27 14:37:12 +00:00
PyTorch MergeBot
112e204797 Revert "[CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299)"
This reverts commit 7163dce1e0.

Reverted https://github.com/pytorch/pytorch/pull/161299 on behalf of https://github.com/nWEIdia due to Incorrectly suppressing useful warnings when running sm89 binary on sm86 ([comment](https://github.com/pytorch/pytorch/pull/161299#issuecomment-3335127621))
2025-09-25 17:13:32 +00:00
Wei Wang
7163dce1e0 [CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299)
Fixes misleading warning messages when running on sm12x devices using binaries built with sm120.
PyTorch binary built with sm120 is compatible with e.g. sm121, so no need for the warning of incompatibility.

Also allow the 'matched_cuda_warn' message to show when e.g. the user is running a binary built with only sm90 on sm12x, so that the user would be prompted to get a build which supports e.g. sm120.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161299
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-24 23:59:19 +00:00
atalman
9d0d98acfe Use cuda nvrtc so file based on cuda version used by torch (#163642)
Fixes https://github.com/pytorch/pytorch/issues/162367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163642
Approved by: https://github.com/msaroufim
2025-09-24 14:23:39 +00:00
Mark Saroufim
fc84743707 Implement CUDA stream protocol (#163614)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163614
Approved by: https://github.com/eqy
2025-09-23 21:02:08 +00:00
atalman
3c64b2abab CUDA 13.0 Warning update for supported architectures (#163585)
Please see build script: 8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)

This should display correct warning:
``
Please install PyTorch with a following CUDA
configurations: 12.6 12.8 13.0 following instructions at
https://pytorch.org/get-started/locally/
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585
Approved by: https://github.com/malfet
2025-09-23 11:27:11 +00:00
Valentin Andrei
bb5be56619 [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-09-23 04:48:19 +00:00
Yuanyuan Chen
60c2bdedcd Replace Literal[None] with None in typing (#163489)
This PR replaces Literal[None] with None in typing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163489
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-22 22:10:08 +00:00
PyTorch MergeBot
4b7aed89d8 Revert "[torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)"
This reverts commit 627482a7b7.

Reverted https://github.com/pytorch/pytorch/pull/162942 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it needs some fixes for CUDA 13 ([comment](https://github.com/pytorch/pytorch/pull/162942#issuecomment-3308784448))
2025-09-18 17:49:16 +00:00
Xinya Zhang
e769026bcb [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094
Fixes #157093
Fixes #157092
Fixes #157091
Fixes #157064
Fixes #157063
Fixes #157062
Fixes #157061
Fixes #157042
Fixes #157041
Fixes #157039
Fixes #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-18 13:53:48 +00:00
vandrei
627482a7b7 [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel
2025-09-18 06:40:07 +00:00
Mark Saroufim
a89d5e97ec compile_kernel remove header_code arg (#163165)
We previously asked users to seperate these because we didn't have any way of adding extern C declarations. Now we don't and we don't need this confusing flag anymore

BC breaking but is fine for this API since it doesn't have major users yet. Please just put your all your code in `kernel_source` moving forward

## BC note
The header_code parameter has been removed from torch.cuda._compile_kernel. Previously, users could pass separate header code that would be prepended to the kernel source. Now, header code must be included directly in the kernel_source parameter.

Note this only affects torch.cuda._compile_kernel, which is a private API.

Example:

Before
```python
kernel = compile_kernel(
    kernel_source="global void my_kernel() { ... }",
    kernel_name="my_kernel",
    header_code="#define SCALE 2.0f\n__device_ float scale(float x) { return x * SCALE; }"
  )
```

After
```python
kernel_source = """
#define SCALE 2.0f
device float scale(float x) { return x * SCALE; }

global void my_kernel() { ... }
"""
kernel = _compile_kernel(kernel_source, "my_kernel")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163165
Approved by: https://github.com/janeyx99, https://github.com/albanD
2025-09-17 19:47:32 +00:00
PyTorch MergeBot
66308fb470 Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)"
This reverts commit cef815dc2c.

Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))
2025-09-16 20:39:41 +00:00
joshuamarkovic
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
jiannanWang
b6a48ff69f [BE] Add Documentation for Device APIs (#162834)
Added documentation for torch.cuda APIs.
Fixed docstring for xpu and mtia is_bf16_supported API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162834
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-09-16 17:01:06 +00:00
Xinya Zhang
cef815dc2c [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-16 12:48:45 +00:00
Mark Saroufim
090e6838a0 compile_kernel enable pch (#162972)
Enabling automatic pre compiled headers per https://docs.nvidia.com/cuda/nvrtc/index.html#example-automatic-pch-cuda-12-8

I'm seeing large speedups in compilation times using PCH on average but the max compilation time with PCH is worst which is why I can't enable it by default. `load_inline()` also supports precompiled headers and does not enable them by default

```
Without PCH: 270.58 ms average
With PCH:    115.27 ms average
```

```
Without PCH: Max: 337.99 ms
With PCH: Max: 383.82 ms
```

```python
source) [marksaroufim@devgpu005]~/pytorch% python simple_pch_benchmark.py
============================================================
Simple PCH Compilation Benchmark
============================================================
Device: NVIDIA B200
Iterations: 100

Testing WITHOUT PCH:
------------------------------
Compiling kernel 100 times WITHOUT PCH...
  Completed 10/100 compilations
  Completed 20/100 compilations
  Completed 30/100 compilations
  Completed 40/100 compilations
  Completed 50/100 compilations
  Completed 60/100 compilations
  Completed 70/100 compilations
  Completed 80/100 compilations
  Completed 90/100 compilations
  Completed 100/100 compilations
Average: 270.58 ms (±6.99 ms)
Min: 264.09 ms
Max: 337.99 ms

Testing WITH PCH:
------------------------------
Compiling kernel 100 times WITH PCH...
  Completed 10/100 compilations
  Completed 20/100 compilations
  Completed 30/100 compilations
  Completed 40/100 compilations
  Completed 50/100 compilations
  Completed 60/100 compilations
  Completed 70/100 compilations
  Completed 80/100 compilations
  Completed 90/100 compilations
  Completed 100/100 compilations
Average: 115.27 ms (±27.32 ms)
Min: 110.65 ms
Max: 383.82 ms

```

## Benchmarking script

```python
#!/usr/bin/env python3
import argparse
import os
import sys
import time
from statistics import mean, stdev

import torch
from torch.cuda._utils import _nvrtc_compile

def benchmark_compilation(use_pch, iterations=100):
    """Compile the same kernel many times with or without PCH."""

    # CUB kernel that benefits from PCH
    kernel_source = """
    #include <cub/block/block_reduce.cuh>
    #include <cub/block/block_scan.cuh>
    #include <cub/warp/warp_reduce.cuh>

    extern "C"
    __global__ void test_kernel(const float* input, float* output, int n) {
        using BlockReduce = cub::BlockReduce<float, 256>;
        using BlockScan = cub::BlockScan<float, 256>;
        using WarpReduce = cub::WarpReduce<float>;

        __shared__ union {
            typename BlockReduce::TempStorage reduce;
            typename BlockScan::TempStorage scan;
            typename WarpReduce::TempStorage warp[8];
        } temp_storage;

        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        float val = (idx < n) ? input[idx] : 0.0f;

        float sum = BlockReduce(temp_storage.reduce).Sum(val);
        __syncthreads();

        float scan_result;
        BlockScan(temp_storage.scan).ExclusiveSum(val, scan_result);
        __syncthreads();

        int warp_id = threadIdx.x / 32;
        float warp_sum = WarpReduce(temp_storage.warp[warp_id]).Sum(val);

        if (threadIdx.x == 0) {
            output[blockIdx.x] = sum + scan_result + warp_sum;
        }
    }
    """

    device = torch.cuda.current_device()
    major, minor = torch.cuda.get_device_capability(device)
    compute_capability = f"{major}{minor}"

    compile_times = []

    print(
        f"Compiling kernel {iterations} times {'WITH' if use_pch else 'WITHOUT'} PCH..."
    )

    for i in range(iterations):
        # Use unique kernel name to avoid caching between iterations
        kernel_name = f"test_kernel_{i}"
        unique_source = kernel_source.replace("test_kernel", kernel_name)

        start = time.perf_counter()

        ptx, mangled_name = _nvrtc_compile(
            unique_source,
            kernel_name,
            compute_capability,
            header_code="",
            nvcc_options=["-std=c++17"],
            auto_pch=use_pch,
        )

        elapsed = time.perf_counter() - start
        compile_times.append(elapsed * 1000)  # Convert to ms

        # Progress indicator
        if (i + 1) % 10 == 0:
            print(f"  Completed {i + 1}/{iterations} compilations")

    return compile_times

def main():
    parser = argparse.ArgumentParser(description="Simple PCH Compilation Benchmark")
    parser.add_argument("--pch", action="store_true", help="Test with PCH only")
    parser.add_argument("--no-pch", action="store_true", help="Test without PCH only")
    parser.add_argument(
        "--iterations", type=int, default=100, help="Number of compilations"
    )
    args = parser.parse_args()

    print("=" * 60)
    print("Simple PCH Compilation Benchmark")
    print("=" * 60)
    print(f"Device: {torch.cuda.get_device_name()}")
    print(f"Iterations: {args.iterations}")
    print()

    # Determine what to test
    test_both = not args.pch and not args.no_pch

    results = {}

    # Test without PCH
    if args.no_pch or test_both:
        print("Testing WITHOUT PCH:")
        print("-" * 30)
        times_no_pch = benchmark_compilation(use_pch=False, iterations=args.iterations)

        if times_no_pch:
            avg_no_pch = mean(times_no_pch)
            std_no_pch = stdev(times_no_pch) if len(times_no_pch) > 1 else 0
            print(f"Average: {avg_no_pch:.2f} ms (±{std_no_pch:.2f} ms)")
            print(f"Min: {min(times_no_pch):.2f} ms")
            print(f"Max: {max(times_no_pch):.2f} ms")
            results["no_pch"] = avg_no_pch
        print()

    # Test with PCH
    if args.pch or test_both:
        print("Testing WITH PCH:")
        print("-" * 30)
        times_with_pch = benchmark_compilation(
            use_pch=True, iterations=args.iterations
        )

        if times_with_pch:
            avg_with_pch = mean(times_with_pch)
            std_with_pch = stdev(times_with_pch) if len(times_with_pch) > 1 else 0
            print(f"Average: {avg_with_pch:.2f} ms (±{std_with_pch:.2f} ms)")
            print(f"Min: {min(times_with_pch):.2f} ms")
            print(f"Max: {max(times_with_pch):.2f} ms")
            results["pch"] = avg_with_pch
        print()

if __name__ == "__main__":
    main()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162972
Approved by: https://github.com/albanD, https://github.com/janeyx99
2025-09-15 22:55:39 +00:00
Thien Tran
84186c39ed [NVRTC] Enable compiling templated kernels (#162875)
Per NVRTC doc - https://docs.nvidia.com/cuda/nvrtc/index.html#accessing-lowered-names, we can compile a templated kernel (e.g. `kernel<float>`) with the following steps

NVRTC side
- (new) `nvrtcAddNameExpression` -> C++ template e.g. `f<float>`
- `nvrtcCompileProgram`
- (new) `nvrtcGetLoweredName` -> get mangled name. need to do a copy since later this string is freed after NVRTC program is destroyed
- `nvrtcDestroyProgram`

CUDA side
- use mangled name instead of normal name -> profit
- `extern "C"` is not even needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162875
Approved by: https://github.com/msaroufim
2025-09-14 06:17:36 +00:00
Aaryaman Vasishta
4a757e1e17 [ROCm] Support torch.cuda._compile_kernel (#162510)
Supports `torch.cuda._compile_kernel` on ROCm. Related to https://github.com/pytorch/pytorch/pull/151484
Tested on Windows with gfx1201. Testing on Linux pending.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162510
Approved by: https://github.com/mycpuorg, https://github.com/msaroufim
2025-09-12 00:18:47 +00:00
Mark Saroufim
7345454e2e compile_kernel: Handle python floats as c double (#162626)
This was an open todo in the code and probably a footgun in waiting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162626
Approved by: https://github.com/malfet
2025-09-11 06:03:25 +00:00
Mark Saroufim
12e993f533 compile_kernel large shared memory fix (#162647)
Alternate solution to https://github.com/pytorch/pytorch/pull/162328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162647
Approved by: https://github.com/eqy
2025-09-11 05:52:46 +00:00
Mark Saroufim
4fd2a2b273 Add cuda headers automatically for compile_kernel (#162634)
Issue was pointed out before by @ngimel and more recently by https://gau-nernst.github.io/nvrtc-matmul/#missing-cuda-and-c-headers- by @gau-nernst

Benefit is now we can add

`#include <cuda_fp16.h>` without crapping out
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162634
Approved by: https://github.com/ngimel
2025-09-11 00:20:33 +00:00
Yu, Guangye
c03d8d4082 Revert "Generalize torch._C._set_allocator_settings to be generic (#156175)" (#161626)
This reverts commit 908c5cc4c0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161626
Approved by: https://github.com/atalman
ghstack dependencies: #161625
2025-08-27 21:37:14 +00:00
Daniel Galvez
cf94cadbee [CUDAGraph] Add getter for cuda graph exec (#161294)
This is far simpler than #155164 since we never destroy the cudaGraphExec_t.

The request comes from TRT-LLM specifically. The motivation is that some power users would like to mutate specific kernel parameters via APIs like `cudaGraphExec*SetParams` after a cuda graph has been instantiated. For example, a common request has been to be able to change the sequence length of attention kernels, after having captured a graph for the largest possible sequence length. It turns out that the host overhead you eliminate via cuda graphs in LLM inference ends up causing an increase in computation time when you size your kernels to the maximum possible sequence length (which I believe is done in both TRT-LLM and vLLM). Attention is the most problematic kernel because its computation time is quadratic in the sequence length, rather than linear.

This can work if your attention kernel can work for arbitrary shapes (this is not the case for all attention implementations! Many of them specialize with templates), and you have a persistent kernel that allocates only as many blocks as you have SM's (so you don't have to figure out how many blocks to allocate for a specific sequence length). Using a conditional SWITCH node is a better generic approach to this problem, but that requires more infrastructure work.

Note that this requires knowledge of the exact location of the value in your kernel's parameter buffer to mutate. It won't work with arbitrary stream capture code whose kernels you don't know before hand. So I expect this code path to be rarely used.

Testing:

```
pytest -s -k raw_graph_exec test/test_cuda.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161294
Approved by: https://github.com/ngimel, https://github.com/BoyuanFeng, https://github.com/eellison, https://github.com/eqy
2025-08-25 20:57:37 +00:00
Natalia Gimelshein
726dce3c94 [nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks (#161238)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161238
Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed
2025-08-25 03:09:32 +00:00
FFFrog
284b719005 Remove the uncessary empty file (#160728)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160728
Approved by: https://github.com/Skylion007
2025-08-19 10:54:08 +00:00
Yu, Guangye
84f7e88aef Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-08 17:41:22 +00:00