pytorch/test/cpp
PaulZhang12 3ed5f1fb77 [CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812)
Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators:
* mm
* bmm
* addmm
* baddmm

Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390

Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-04-18 01:53:26 +00:00
..
aoti_abi_check [AOTI] Fix complex64 not defined (#132810) 2024-08-08 18:08:23 +00:00
aoti_inference [AOTInductor] Add states for constant folding process (#151273) 2025-04-17 16:41:38 +00:00
api Set requires grad in TensorMaker::make_tensor() (#148255) 2025-03-29 08:06:42 +00:00
c10d [c10d] Add _allgather_base , reduce_scatter , and _reduce_scatter_base into ProcessGroupMPI to enable FSDP with MPI backend (#150162) 2025-04-14 19:31:38 +00:00
common
dist_autograd Set RUNPATH so installed tests can find the required shared libraries (#136627) 2024-10-25 09:38:08 +00:00
jit [Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068) 2025-04-11 18:50:25 +00:00
lazy Introduce cache clearing APIs for the lazy graph executor (#144489) 2025-01-29 17:38:01 +00:00
lite_interpreter_runtime Add None return type to init -- tests (#132352) 2024-08-01 15:44:51 +00:00
monitor
profiler [codemod] Fix a few unused-variable issues in pytorch (#143517) 2024-12-19 00:18:08 +00:00
rpc [rpc] Fix unit test after c10::nullopt removal (#143690) 2024-12-20 23:36:07 +00:00
tensorexpr [CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812) 2025-04-18 01:53:26 +00:00
__init__.py