mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Resubmission of #144974 which was reverted for unrelated reasons. Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966 Approved by: https://github.com/danthe3rd |
||
|---|---|---|
| .. | ||
| _dynamo | ||
| __init__.pyi.in | ||
| _aoti.pyi | ||
| _autograd.pyi | ||
| _cpu.pyi | ||
| _cudnn.pyi | ||
| _cusparselt.pyi | ||
| _distributed_autograd.pyi | ||
| _distributed_c10d.pyi | ||
| _distributed_rpc_testing.pyi | ||
| _distributed_rpc.pyi | ||
| _export.pyi | ||
| _functions.pyi | ||
| _functorch.pyi | ||
| _instruction_counter.pyi | ||
| _itt.pyi | ||
| _lazy_ts_backend.pyi | ||
| _lazy.pyi | ||
| _monitor.pyi | ||
| _nn.pyi.in | ||
| _nvtx.pyi | ||
| _onnx.pyi | ||
| _profiler.pyi | ||
| _VariableFunctions.pyi.in | ||
| _verbose.pyi | ||
| build.bzl | ||
| return_types.pyi.in | ||