mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
When using the Cutlass backend, the compilation of CUDA source files can totally dominate the runtime required for the benchmarking done as part of Autotuning. This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a possible on-disk sccache ). Also it ensures that no unneccessary compilation and benchmarking steps are performed, which was peviously the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386 Approved by: https://github.com/aakhundov |
||
|---|---|---|
| .. | ||
| cutlass_lib_extensions | ||
| __init__.py | ||
| cuda_cpp_scheduling.py | ||
| cuda_env.py | ||
| cuda_kernel.py | ||
| cuda_template.py | ||
| cutlass_epilogue_gen.py | ||
| cutlass_utils.py | ||
| device_op_overrides.py | ||
| gemm_template.py | ||