pytorch/torch/_inductor/codegen/cuda
Kai Londenberg 5d81ade484 [Inductor max autotune] Multithreaded Precompilation (#119386)
When using the Cutlass backend, the compilation
of CUDA source files can totally dominate the runtime required for the benchmarking done
as part of Autotuning.

This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a
possible on-disk sccache ).

Also it ensures that no unneccessary compilation
and benchmarking steps are performed, which was peviously the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386
Approved by: https://github.com/aakhundov
2024-02-09 16:11:30 +00:00
..
cutlass_lib_extensions Delete a bunch of type-ignores (#113990) 2023-11-18 02:48:38 +00:00
__init__.py
cuda_cpp_scheduling.py Delete a bunch of type-ignores (#113990) 2023-11-18 02:48:38 +00:00
cuda_env.py Enable local_partial_types (#118467) 2024-01-28 13:38:22 +00:00
cuda_kernel.py [Inductor max autotune] Multithreaded Precompilation (#119386) 2024-02-09 16:11:30 +00:00
cuda_template.py [Inductor CUTLASS backend] Epilogue fusion codegen (Step 1) (#110890) 2023-11-06 19:42:10 +00:00
cutlass_epilogue_gen.py Enable possibly-undefined error code (#118533) 2024-01-30 21:07:01 +00:00
cutlass_utils.py Enable local_partial_types (#118467) 2024-01-28 13:38:22 +00:00
device_op_overrides.py [Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020) 2023-12-22 08:42:51 +00:00
gemm_template.py [BE]: Enable F821 and fix bugs (#116579) 2024-01-01 08:40:46 +00:00