pytorch/torch/csrc/jit/codegen/fuser
Aaron Gokaslan 0247ed27cc Apply Clang-Tidy readability-container-size-empty (#93236)
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
2023-01-29 23:28:19 +00:00
..
cpu
cuda [NVFUSER] refactor nvfuser build (#89621) 2023-01-26 02:50:44 +00:00
arg_spec.h
codegen.cpp
codegen.h
compiler.cpp
compiler.h
executor.cpp
executor.h
fallback.cpp
fallback.h
fused_kernel.h
interface.cpp
interface.h
kernel_cache.cpp Apply modernize-use-emplace to aten, c10, torch (#91077) 2022-12-19 07:49:56 +00:00
kernel_cache.h
kernel_spec.h
partition_desc.h
README.md
tensor_desc.h Apply Clang-Tidy readability-container-size-empty (#93236) 2023-01-29 23:28:19 +00:00
tensor_info.h

PyTorch Fuser

The fuser accepts subgraphs wrapped in "fusion nodes" and tries to execute them by just-in-time (JIT) compiling kernels that run all the graph operations.

Code Organization

The fuser is designed hierarchically with device-independent logic eventually deferring to device-specific logic and implementation. The device-specific code is (mostly) found in each devices' subdirectory. The device-independent logic has six components:

  • The Interface (interface.h/cpp) has functions to register and run fusions, interrogate fusion functionality, and perform debugging.
  • The Compiler (compiler.h/cpp) performs "upfront" and "runtime" compilation. When fusions are registered, upfront compilation produces fallback code and and performs some shape inference. When a fusion is run, runtime compilation invokes code generation and the device-specific compilation logic.
  • The Code Generator (codegen.h/cpp) produces the string to be compiled on the device.
  • The Executor (executor.h/cpp) runs requested fusions. It performs shape inference, expands tensors as necessary, determines the device to run on, acquires a cached compiled kernel or requests the Compiler produce a new one, invokes device-specific code to launch the kernel and updates the stack.
  • The Fallback (fallback.h/cpp) runs subgraphs that can't be fused because shape inference didn't determine a common tensor size or the device the tensors are on doesn't support fusion.
  • The Kernel Specification Cache (kernel_cache.h/cpp) is a thread-safe cache holding the device-independent specifications produced during upfront compilation. These specifications each have their own thread-safe stores of compiled kernels that the Executor checks before requesting runtime compilation.

The device-specific components have logic for compiling and running code in FusedKernelCPU (cpu/fused_kernel.h/cpp) and FusedKernelCUDA (cuda/fused_kernel.h/cpp).