pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Luca Wehrstedt 60d94ea22b Add option to limit number of SMs used by matmul kernels (#147966 ) Resubmission of #144974 which was reverted for unrelated reasons. Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software. Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels. While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels. For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later. I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966 Approved by: https://github.com/danthe3rd		2025-02-26 12:01:12 +00:00
..
_dynamo	[dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355 )	2025-02-18 21:37:12 +00:00
__init__.pyi.in	Add option to limit number of SMs used by matmul kernels (#147966 )	2025-02-26 12:01:12 +00:00
_aoti.pyi	[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269 )	2024-12-10 05:05:08 +00:00
_autograd.pyi	update _unsafe_set_version_counter to accept lists of tensors (#137921 )	2025-02-04 04:51:11 +00:00
_cpu.pyi	[CPUInductor] Fix SVE256 detection (#146207 )	2025-02-01 18:51:34 +00:00
_cudnn.pyi	Improve typing in torch/types.py (#145237 )	2025-01-28 05:29:12 +00:00
_cusparselt.pyi	[sparse] Add cuSPARSELt as a backend (#128534 )	2024-08-21 22:06:07 +00:00
_distributed_autograd.pyi	remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369 )	2024-12-17 18:09:28 +00:00
_distributed_c10d.pyi	Fix type stubs for SymmetricMemory (#146310 )	2025-02-21 19:59:43 +00:00
_distributed_rpc_testing.pyi	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )	2024-06-29 09:23:39 +00:00
_distributed_rpc.pyi	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )	2024-06-29 09:23:39 +00:00
_export.pyi	[export] Implement cpp deserializer. (#136398 )	2024-11-14 16:34:59 +00:00
_functions.pyi	PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102 )	2025-01-18 20:47:12 +00:00
_functorch.pyi	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )	2024-06-29 09:23:39 +00:00
_instruction_counter.pyi	Add compile time instruction count metric (#133834 )	2024-08-27 23:29:02 +00:00
_itt.pyi
_lazy_ts_backend.pyi	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )	2024-06-29 09:23:39 +00:00
_lazy.pyi	remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370 )	2024-12-17 17:18:10 +00:00
_monitor.pyi	PEP585: More UP006 fixes (#146392 )	2025-02-20 06:18:13 +00:00
_nn.pyi.in	Add `padding_side` to `pad_sequence` with `"left"` and `"right"` options (`"right"` as default) (#131884 )	2024-08-07 15:53:07 +00:00
_nvtx.pyi	Inductor annotations (#130429 )	2024-12-10 08:53:39 +00:00
_onnx.pyi	[1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675 )	2024-06-17 21:25:59 +00:00
_profiler.pyi	[Profiler] Create Auto-Trace Frontend for Trace ID (#139310 )	2024-10-31 19:02:57 +00:00
_VariableFunctions.pyi.in	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 )	2024-06-08 18:16:33 +00:00
_verbose.pyi
build.bzl
return_types.pyi.in	[torchgen] reference generated comment to actual location of the generator and template (#130020 )	2024-07-05 21:47:14 +00:00