Commit Graph

212 Commits

Author SHA1 Message Date
Nikita Shulga
6b0c6f2856 [BE] Delete pre-CUDA-10.1 code from SparseCUDABlas (#155079)
As latest PyTorch is no longer buildable against it CUDA-10, so this is essentially a dead code

Made small change to hipify script to rename `cusparseGetErrorString` to `hipsparseGetErrorString`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155079
Approved by: https://github.com/atalman, https://github.com/cyyever
2025-06-04 03:29:24 +00:00
Peter Y. Yeh
43390d8b13 ROCm Sparsity through HipSparseLT (#150578)
TLDR:

- This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity.
- Require **ROCm 6.4** && **gfx942/gfx950**.
- The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way.

### Dense vs. Sparse Performance Comparison

#### **NT (Row-major)**
**Average Uplift**: `1.20`

| M     | N      | K      | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-------|--------|--------|-------------------------|-------------------------------|--------|
| 14336 | 8      | 4096   | 20.05                   | 25.3                          | 1.26   |
| 4096  | 8      | 14336  | 21.07                   | 25.28                         | 1.20   |
| 3072  | 3072   | 10240  | 299.05                  | 351.82                        | 1.18   |
| 3072  | 1536   | 768    | 18.56                   | 20.05                         | 1.08   |
| 3072  | 17664  | 768    | 163.13                  | 173.91                        | 1.07   |
| 3072  | 196608 | 768    | 1717.30                 | 1949.63                       | 1.14   |
| 3072  | 24576  | 768    | 206.84                  | 242.98                        | 1.17   |
| 3072  | 6144   | 768    | 53.90                   | 56.88                         | 1.06   |
| 3072  | 98304  | 768    | 833.77                  | 962.28                        | 1.15   |
| 768   | 1536   | 768    | 8.53                    | 19.65                         | 2.30   |
| 768   | 17664  | 768    | 46.02                   | 46.84                         | 1.02   |
| 768   | 196608 | 768    | 463.15                  | 540.46                        | 1.17   |
| 768   | 24576  | 768    | 54.32                   | 59.55                         | 1.10   |
| 768   | 6144   | 768    | 19.47                   | 20.15                         | 1.03   |
| 768   | 98304  | 768    | 231.88                  | 258.73                        | 1.12   |

---

#### **NN (Row-major)**
**Average Uplift**: `1.13`

| M   | N      | K     | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-----|--------|-------|-------------------------|-------------------------------|--------|
| 768 | 1536   | 3072  | 27.50                   | 28.78                         | 1.05   |
| 768 | 17664  | 3072  | 125.06                  | 158.94                        | 1.27   |
| 768 | 196608 | 3072  | 1568.38                 | 1767.12                       | 1.13   |
| 768 | 24576  | 3072  | 171.05                  | 203.49                        | 1.19   |
| 768 | 6144   | 3072  | 58.72                   | 60.39                         | 1.03   |
| 768 | 98304  | 3072  | 787.15                  | 887.60                        | 1.13   |

-------------------------

This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt.

### ROCm and hipSPARSELt Support:

* [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support.
* [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version.
* [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration.
* [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro.
* `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153)

### Codebase Updates:

* [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285)

### Test Suite Updates:

* [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578
Approved by: https://github.com/jeffdaily
2025-05-31 02:03:40 +00:00
Dmitry Nikolaev
f419067e50 [ROCm] improve sparse addmm, enable complex (#153262)
PR to:
- enable complex data types for sparse matmul on ROCm
- fix sparse addmm/baddbmm on ROCm
- fix sparse hipification for ROCm
- fix/enable sparse tests on ROCm (~40 tests total):
```
test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_*
test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_*
test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS*
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153262
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-05-19 22:23:18 +00:00
Zhe Qu
1e9666b32d Add cudaLaunchKernel to cuda_to_hip_mappings (#153690)
Summary: as $title

Test Plan:
Used in D74789639

Rollback Plan:

Reviewed By: cenzhaometa

Differential Revision: D74789639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-16 23:37:11 +00:00
Yulun Wang
3aa84775e7 [hipify] Replace cuda error cudaErrorContextIsDestroyed (#153576)
Summary: The cuda symbol the cuda symbol cudaErrorContextIsDestroyed is not converted to hipErrorContextIsDestroyed. Add this convertion

Test Plan: CI

Differential Revision: D74542735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153576
Approved by: https://github.com/xw285cornell, https://github.com/cyyever
2025-05-16 16:19:42 +00:00
Tristan Rice
d1dd2c1fc8 gloo: cuda (#153406)
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.

This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441

Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.

The gloo submodule is updated to depend on the new Gloo changes

Test plan:

```py
import os
import time

transport = "TCP"
#transport = "IBVERBS"

os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)

ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"

import torch
import torch.distributed as dist

dist.init_process_group("gloo")

rank = dist.get_rank()

# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")

device = "cpu"

iters = 10
warmup_iters = 2

for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
    t = torch.zeros(nelem, device=device)

    torch.cuda.current_stream().synchronize()
    for i in range(warmup_iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    start = time.perf_counter()

    for i in range(iters):
        dist.all_reduce(t)

    torch.cuda.current_stream().synchronize()

    dur = (time.perf_counter() - start)
    qps = iters/dur

    bandwidth_gb = t.nbytes * iters / dur / 1e9

    gb = t.nbytes / 1e9

    if rank == 0:
        print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
2025-05-16 01:13:13 +00:00
Anthony Shoumikhin
7d39e73c57 Fix more URLs (#153277)
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
Benson Ma
192f7140d1 [fbgemm_gpu] Replace C10_CUDA_KERNEL_LAUNCH_CHECK() in the KernelLauncher (#153178)
Summary:
- Replace `C10_CUDA_KERNEL_LAUNCH_CHECK()` in the `KernelLauncher`, as the
  latter does not print __FILE__ and __LINE__

The existing `C10_CUDA_KERNEL_LAUNCH_CHECK()` implementation does not print the source file and line number when a CUDA kernel launch throws an error, leaving users confused with a context-less message like `CUDA error: invalid arguments`.  This new check is a slimmed re-implementation of the macro with extra context information added to the error (beyond just file and line number) so that we can at least locate the FBGEMM source file or template where the error first surfaces.

Test Plan:
```
buck2 run 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher

buck2 run 'fbcode//mode/opt-amd-gpu' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
```

Reviewed By: sryap

Differential Revision: D74364031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153178
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-05-09 17:43:16 +00:00
Anthony Shoumikhin
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
Benson Ma
2180e87d7c [fbgemm_gpu] Incorporate Torch DSA (#151148)
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1035

X-link: https://github.com/pytorch/FBGEMM/pull/3950

- Incorporte the PyTorch DSA infrastructure into the FBGEMM kernel launcher
  utility

Test Plan:
```
# Nvidia
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder
buck2 test 'fbcode//mode/opt' fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck
buck2 run 'fbcode//mode/opt'  -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=a100  -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher

# AMD
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:tensor_accessor_builder_with_memcheck
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/utils:kernel_launcher
buck2 run mode/opt-amd-gpu -c fbcode.platform=platform010 fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:split_embeddings_utils
```

Differential Revision: D72759030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151148
Approved by: https://github.com/huydhn
2025-04-15 11:34:04 +00:00
Hollow Man
0692301e25 Catch OSError in general when writing files (#149464)
Redundant exception types in `except (PermissionError, OSError):`.  Write `except OSError:`, which catches exactly the same exceptions.

https://github.com/pytorch/pytorch/actions/runs/13935844871/job/39141062991

When hipify files, or writing cprofile files, PermissionError is not enough when the file is located in a place that is not writable at all, or other OS errors happened when writing files.

This fix makes the code more robust.

Example error log:
```log
  File "deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/ops/op_builder/builder.py", line 540, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/ops/op_builder/builder.py", line 587, in jit_load
    op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/cpp_extension.py", line 1597, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "torch/utils/cpp_extension.py", line 2031, in _jit_compile
    hipify_result = hipify_python.hipify(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 1167, in hipify
    preprocess_file_and_save_result(output_directory, filepath, all_files, header_include_dirs,
  File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
    result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 940, in preprocessor
    output_source = RE_QUOTE_HEADER.sub(mk_repl('#include "{0}"', True), output_source)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 919, in repl
    preprocess_file_and_save_result(output_directory,
  File "torch/utils/hipify/hipify_python.py", line 213, in preprocess_file_and_save_result
    result = preprocessor(output_directory, filepath, all_files, header_include_dirs, stats,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 986, in preprocessor
    with clean_ctx.open(fout_path, 'w', encoding='utf-8') as fout:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "torch/utils/hipify/hipify_python.py", line 123, in open
    return open(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 30] Read-only file system: 'deepspeed/ops/csrc/adam/multi_tensor_apply_hip.cuh'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149464
Approved by: https://github.com/janeyx99
2025-03-21 02:42:50 +00:00
Ethan Wee
6cbf97ede8 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/izaitsevfb

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 23:42:35 +00:00
PyTorch MergeBot
e1d143cb7b Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit ee1a2b7810.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/izaitsevfb due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2738115728))
2025-03-19 21:12:13 +00:00
Ethan Wee
ee1a2b7810 [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 03:59:55 +00:00
tvukovic-amd
268de64005 [ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382)
- Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library)
- Set hipcc call for Windows case
- Removed CUDA flags (not used in ROCm) on Windows
- Updated Windows compiler (added case when using ROCm on Windows)
- Fixed path issue in hipify_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-18 23:37:05 +00:00
PyTorch MergeBot
9d37b501db Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit 2e02c07a5d.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))
2025-03-17 16:17:02 +00:00
Ethan Wee
2e02c07a5d [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily
2025-03-14 18:21:27 +00:00
Peter Yeh
81dccd706b [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-24 22:47:52 +00:00
PyTorch MergeBot
3e2d9d079e Revert "[ROCm] OCP FP8 Support for new GPUs (#146632)"
This reverts commit f95ab46797.

Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))
2025-02-23 12:04:50 +00:00
Peter Yeh
f95ab46797 [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 23:44:08 +00:00
Eli Uriegas
75a4b73816 utils: Update md5 call to be fips compliant (#147252)
Updates md5 call to be fips compliant according to this issue:
* https://github.com/pytorch/pytorch/issues/147236

Not going to add a conditional here because minimum the python version
that we support is already 3.9

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147252
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/malfet
2025-02-15 15:19:08 +00:00
Aaron Orenstein
2f9d378f7b PEP585 update - torch/utils (#145201)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145201
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:10 +00:00
Xiaodong Wang
3d3a07963f [reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145)
Summary:
https://github.com/pytorch/pytorch/pull/143549 was reverted due to some
internal/oss tooling issue. Relanding.

hipblaslt supports TF32, so adding the support.
Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67785496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145
Approved by: https://github.com/jianyuh
2025-01-06 00:37:01 +00:00
Michal Gallus
93633d0e80 [ROCm][Windows] Fix export macros (#144098)
For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098
Approved by: https://github.com/jeffdaily
2025-01-04 17:12:46 +00:00
PyTorch MergeBot
448c16ac87 Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549)"
This reverts commit 41cdc7f735.

Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see 06b4b96b34/1 ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))
2024-12-23 06:47:36 +00:00
Xiaodong Wang
41cdc7f735 [reland][AMD] Turn on TF32 for aten::mm (#143549)
Summary:
hipblaslt supports TF32, so adding the support.

Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67431681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549
Approved by: https://github.com/eqy
2024-12-22 21:05:05 +00:00
PyTorch MergeBot
7ab3177776 Revert "[AMD] Turn on TF32 for aten::mm (#139869)"
This reverts commit e0bdae7884.

Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))
2024-12-16 16:46:48 +00:00
Xiaodong Wang
e0bdae7884 [AMD] Turn on TF32 for aten::mm (#139869)
Summary: hipblaslt supports TF32, so adding the support.

Test Plan: CI

Differential Revision: D65435392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869
Approved by: https://github.com/leitian
2024-12-15 10:02:29 +00:00
Alex Denisov
539286a67b Inductor annotations (#130429)
Add NVTX annotations around training phases and buffer computations

RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224

<img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429
Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD

Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>
2024-12-10 08:53:39 +00:00
Dmitry Nikolaev
c9db2c6328 [ROCm] cudagraph explicit sync only after capture_begin() (#138722)
hipGraphExecDestroy doesn't immediately free memory since rocm6.2.
They wait for next sync point in order to free the memory, this is to ensure that all hipGraphLaunch are finished before we release any memory.
We need to ensure all async opreations finish before deleting the object.

capture_dev_ variable is used to save the device number when capture_begin() method is called
But CUDAGraph can be created and destroyed without calling capture_begin() method. `capture_dev_ = UNDEFINED_DEVICE;` allows to detect such a case and skip sync

Tests impacted:
test_cuda.py::TestCuda::test_graph_make_graphed_callables_*
distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_allreduce_in_cudagraph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138722
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/jeffdaily
2024-11-20 19:37:22 +00:00
Aaron Gokaslan
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
Kefei Lu
d2d1258b1b Speed up AMD AOT Inductor lowering by memoizing hipify trie to regex logic (#140156)
Summary:
AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time.

This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie.

Differential Revision: D65659445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156
Approved by: https://github.com/ColinPeppler

Co-authored-by: Kefei Lu <kefeilu@meta.com>
2024-11-09 04:28:58 +00:00
Jeff Daily
7c7b2d89ba [ROCm] set hipblas workspace (#138791)
Fixes #138532.

This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791
Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-29 01:37:55 +00:00
Tom Ritchford
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
Xiaodong Wang
b14c9b7250 [AMD] Hipify torchaudio_decoder (#138181)
Summary:
X-link: https://github.com/pytorch/audio/pull/3843

Continue to hipify more torchaudio targets.

Test Plan:
CI

  buck build mode/opt-amd-gpu pytorch/audio/src/...

Differential Revision: D64298970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138181
Approved by: https://github.com/houseroad
2024-10-17 23:37:37 +00:00
Xiaodong Wang
eea1f79a1d [AMD] use rccl.h instead of rccl/rccl.h (#135472)
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.

Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True  --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true  //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom   data_loader.dataset.table_ds=[2024-09-04]   data_loader.dataset.batch_size=512  max_ind_range=10

w/o this diff, it'll show 2.18 nccl version

Differential Revision: D62371434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
2024-10-10 08:55:57 +00:00
PyTorch MergeBot
7e8dace0de Revert "[ROCm] remove caffe2 from hipify (#137157)"
This reverts commit 40d8260745.

Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))
2024-10-08 17:45:45 +00:00
Jeff Daily
40d8260745 [ROCm] remove caffe2 from hipify (#137157)
- Remove all "MasqueradingAsCUDA" files and classes.
- Do not rename "CUDA" classes to "HIP".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157
Approved by: https://github.com/eqy
2024-10-05 12:48:54 +00:00
Michal Gallus
79562f3af8 [ROCm] Modify hipify script to work with Windows paths (#135360)
This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one.

Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour.

Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2024-10-04 23:43:43 +00:00
Jeff Daily
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
PyTorch MergeBot
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 7001907480.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
Jeff Daily
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
Josh Fromm
f347174d61 Hipify Pytorch3D (#133343)
Summary:
X-link: https://github.com/fairinternal/pytorch3d/pull/45

X-link: https://github.com/facebookresearch/pytorch3d/pull/1851

Very minor change to extend hipification to a missing hipcub constant. This is needed to hipify some of the kernels in pytorch3d.

Differential Revision: D61171993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133343
Approved by: https://github.com/houseroad
2024-08-15 23:39:07 +00:00
Jerry Mannil
42f647219a [ROCm] Add int4 support (#129710)
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes #124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-09 19:49:12 +00:00
PyTorch MergeBot
d7b7f8b79f Revert "[ROCm] Add int4 support (#129710)"
This reverts commit d0ad13fa42.

Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))
2024-07-08 16:07:53 +00:00
Jerry Mannil
d0ad13fa42 [ROCm] Add int4 support (#129710)
Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction.
Only supports CDNA2 and CDNA3 gpus for now.
Fixes #124699

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-07 23:54:22 +00:00
Aaron Orenstein
57536286e2 Flip default value for mypy disallow_untyped_defs [10/11] (#127847)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127847
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843, #127844, #127845, #127846
2024-06-08 18:50:06 +00:00
Xuehai Pan
8b08b0f340 [BE] enable ruff rule Q from flake8-quotes (#127713)
Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes:

- [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003)
- [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713
Approved by: https://github.com/ezyang
2024-06-02 23:25:26 +00:00
youkaichao
82b4528788 [cudagraph] fix verbose graph logging (#126694)
According to the [doc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g0907ca7a1e7d0211b71ee49c5403072b):

> enum cudaGraphDebugDotFlags
> CUDA Graph debug write options
>
> Values
> cudaGraphDebugDotFlagsVerbose = 1<<0
> Output all debug data as if every debug flag is enabled
> cudaGraphDebugDotFlagsKernelNodeParams = 1<<2
> Adds cudaKernelNodeParams to output
> cudaGraphDebugDotFlagsMemcpyNodeParams = 1<<3
> Adds cudaMemcpy3DParms to output
> cudaGraphDebugDotFlagsMemsetNodeParams = 1<<4
> Adds cudaMemsetParams to output
> cudaGraphDebugDotFlagsHostNodeParams = 1<<5
> Adds cudaHostNodeParams to output
> cudaGraphDebugDotFlagsEventNodeParams = 1<<6
> Adds cudaEvent_t handle from record and wait nodes to output
> cudaGraphDebugDotFlagsExtSemasSignalNodeParams = 1<<7
> Adds cudaExternalSemaphoreSignalNodeParams values to output
> cudaGraphDebugDotFlagsExtSemasWaitNodeParams = 1<<8
> Adds cudaExternalSemaphoreWaitNodeParams to output
> cudaGraphDebugDotFlagsKernelNodeAttributes = 1<<9
> Adds cudaKernelNodeAttrID values to output
> cudaGraphDebugDotFlagsHandles = 1<<10
> Adds node handles and every kernel function handle to output
> cudaGraphDebugDotFlagsConditionalNodeParams = 1<<15
> Adds cudaConditionalNodeParams to output
>

`1 << 10` is not the most verbose flag. it is just one flag to add node handles and every kernel function handle to output. `1 << 0` is the most verbose flag, under the name `cudaGraphDebugDotFlagsVerbose`.

Here is an example of graph, dumped with `1 << 10`:

```dot
digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="rectangle" label="0
MEM_ALLOC
node handle: 0x000055D2889750F0
"];

"graph_1_node_1"[style="bold" shape="octagon" label="1
_Z3addPhS_S_m
node handle: 0x000055D288979A20
func handle: 0x000055D288978D40
"];

"graph_1_node_2"[style="solid" shape="trapezium"label="2
MEMCPY
node handle: 0x000055D28897A130
(DtoH,1024)
"];

"graph_1_node_3"[style="solid" shape="rectangle" label="3
MEM_FREE
node handle: 0x000055D2889890C0
"];

"graph_1_node_0" -> "graph_1_node_1";
"graph_1_node_1" -> "graph_1_node_2";
"graph_1_node_2" -> "graph_1_node_3";
}
}
```

The same graph dumped with `1 << 0`:

```dot
digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="record" label="{
MEM_ALLOC
| {{ID | node handle} | {0 (topoId: 3) | 0x000055D2889750F0}}
| {{{poolProps | {allocType | handleTypes | {location | {type | id}}} | {PINNED | NONE | DEVICE | 0}}}}
| {{bytesize | dptr} | {1024 | 0x0000000A02000000}}
}"];

"graph_1_node_1"[style="bold" shape="record" label="{KERNEL
| {ID | 1 (topoId: 2) | _Z3addPhS_S_m\<\<\<4,256,0\>\>\>}
| {{node handle | func handle} | {0x000055D288979A20 | 0x000055D288978D40}}
| {accessPolicyWindow | {base_ptr | num_bytes | hitRatio | hitProp | missProp} | {0x0000000000000000 | 0 | 0.000000 | N | N}}
| {cooperative | 0}
| {priority | 0}
}"];

"graph_1_node_2"[style="solid" shape="record" label="{
MEMCPY
| {{ID | node handle} | {2 (topoId: 1) | 0x000055D28897A130}}
| {kind | DtoH (DEVICE to HOST PAGEABLE)}
| {{srcPtr | dstPtr} | {pitch | ptr | xsize | ysize | pitch | ptr | xsize | ysize} | {0 | 0x0000000A02000000 | 0 | 0 | 0 | 0x000055D287CA6DB0 | 0 | 0}}
| {{srcPos | {{x | 0} | {y | 0} | {z | 0}}} | {dstPos | {{x | 0} | {y | 0} | {z | 0}}} | {Extent | {{Width | 1024} | {Height | 1} | {Depth | 1}}}}
}"];

"graph_1_node_3"[style="solid" shape="record" label="{
MEM_FREE
| {{ID | node handle} | {3 (topoId: 0) | 0x000055D2889890C0}}
| {{dptr} | {0x0000000A02000000}}
}"];

"graph_1_node_0" -> "graph_1_node_1" [headlabel=0];
"graph_1_node_1" -> "graph_1_node_2" [headlabel=0];
"graph_1_node_2" -> "graph_1_node_3" [headlabel=0];
}
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126694
Approved by: https://github.com/eqy, https://github.com/eellison
2024-05-21 00:55:15 +00:00