pytorch/torch/_C
Luca Wehrstedt 60d94ea22b Add option to limit number of SMs used by matmul kernels (#147966)
Resubmission of #144974 which was reverted for unrelated reasons.

Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966
Approved by: https://github.com/danthe3rd
2025-02-26 12:01:12 +00:00
..
_dynamo [dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355) 2025-02-18 21:37:12 +00:00
__init__.pyi.in Add option to limit number of SMs used by matmul kernels (#147966) 2025-02-26 12:01:12 +00:00
_aoti.pyi [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269) 2024-12-10 05:05:08 +00:00
_autograd.pyi update _unsafe_set_version_counter to accept lists of tensors (#137921) 2025-02-04 04:51:11 +00:00
_cpu.pyi [CPUInductor] Fix SVE256 detection (#146207) 2025-02-01 18:51:34 +00:00
_cudnn.pyi Improve typing in torch/types.py (#145237) 2025-01-28 05:29:12 +00:00
_cusparselt.pyi [sparse] Add cuSPARSELt as a backend (#128534) 2024-08-21 22:06:07 +00:00
_distributed_autograd.pyi remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369) 2024-12-17 18:09:28 +00:00
_distributed_c10d.pyi Fix type stubs for SymmetricMemory (#146310) 2025-02-21 19:59:43 +00:00
_distributed_rpc_testing.pyi Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419) 2024-06-29 09:23:39 +00:00
_distributed_rpc.pyi Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419) 2024-06-29 09:23:39 +00:00
_export.pyi [export] Implement cpp deserializer. (#136398) 2024-11-14 16:34:59 +00:00
_functions.pyi PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102) 2025-01-18 20:47:12 +00:00
_functorch.pyi Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419) 2024-06-29 09:23:39 +00:00
_instruction_counter.pyi Add compile time instruction count metric (#133834) 2024-08-27 23:29:02 +00:00
_itt.pyi
_lazy_ts_backend.pyi Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419) 2024-06-29 09:23:39 +00:00
_lazy.pyi remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370) 2024-12-17 17:18:10 +00:00
_monitor.pyi PEP585: More UP006 fixes (#146392) 2025-02-20 06:18:13 +00:00
_nn.pyi.in Add padding_side to pad_sequence with "left" and "right" options ("right" as default) (#131884) 2024-08-07 15:53:07 +00:00
_nvtx.pyi Inductor annotations (#130429) 2024-12-10 08:53:39 +00:00
_onnx.pyi [1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675) 2024-06-17 21:25:59 +00:00
_profiler.pyi [Profiler] Create Auto-Trace Frontend for Trace ID (#139310) 2024-10-31 19:02:57 +00:00
_VariableFunctions.pyi.in Flip default value for mypy disallow_untyped_defs [1/11] (#127838) 2024-06-08 18:16:33 +00:00
_verbose.pyi
build.bzl
return_types.pyi.in [torchgen] reference generated comment to actual location of the generator and template (#130020) 2024-07-05 21:47:14 +00:00