pytorch/cmake
Natalia Gimelshein 53a1a022a9 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 21:49:46 +00:00
..
External [ROCm] Bump AOTriton to 0.9.2b (#148433) 2025-03-07 22:10:07 +00:00
Modules [ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097) 2025-03-10 22:47:15 +00:00
Modules_CUDA_fix [NVIDIA] Full Family Blackwell Support codegen (#145436) 2025-01-24 04:36:00 +00:00
public [ROCm][Windows] Enable hipblaslt for Windows (#148563) 2025-03-10 21:07:16 +00:00
Allowlist.cmake
BuildVariables.cmake
Caffe2Config.cmake.in [Submodule] Remove third-party onnx-tensorrt (#126542) 2024-05-19 22:34:24 +00:00
CheckAbi.cmake remove abi uncertainty and potential abi conflict (#94306) 2023-02-09 09:54:04 +00:00
cmake_uninstall.cmake.in
Codegen.cmake [WIP] Initial implementation of Grouped Gemm API (#148531) 2025-03-11 21:49:46 +00:00
DebugHelper.cmake
Dependencies.cmake [ROCm][Windows] Enable hipblaslt for Windows (#148563) 2025-03-10 21:07:16 +00:00
FlatBuffers.cmake
GoogleTestPatch.cmake Simplify cmake code (#91546) 2023-02-08 01:05:19 +00:00
IncludeSource.cpp.in
iOS.cmake [executorch] Update iOS toolchain with a modern cmake syntax. (#115799) 2023-12-15 00:51:30 +00:00
Metal.cmake [MPS] Support includes in metal objects (#145087) 2025-01-18 05:35:22 +00:00
MiscCheck.cmake Remove CAFFE2_USE_EXCEPTION_PTR (#147247) 2025-03-06 02:56:23 +00:00
prioritized_text.txt [Build] Add linker script optimization (#121975) 2024-04-09 20:22:25 +00:00
ProtoBuf.cmake [BE] Cleanup CMake flag suppressions (#97584) 2023-03-27 18:46:09 +00:00
ProtoBufPatch.cmake Migrate PyTorch to C++17 (#85969) 2022-12-08 02:27:48 +00:00
Summary.cmake Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505) 2025-01-23 18:50:59 +00:00
TorchConfig.cmake.in Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505) 2025-01-23 18:50:59 +00:00
TorchConfigVersion.cmake.in
VulkanCodegen.cmake [BE][CMake] Use FindPython module (#124613) 2024-05-29 13:17:35 +00:00
VulkanDependencies.cmake [Vulkan] Remove GLSL Code Gen (#91912) 2023-01-10 20:29:47 +00:00