mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary: **Summary:** This PR contains the infrastructure of a new CUDA fuser. This CUDA fuser is based on many of the same principles of TensorExpressions and Halide, however the implementation is ground up. The fusion pass itself is similar to the default CUDA fuser, however, it has undergone some refactoring and is using the new code generation infrastructure. For those who are interested in how the code generation in this PR works, I would recommend reviewing _test/cpp/jit/test_gpu_fusion.cpp_ as well as the long comment section at the beginning of _torch/csrc/jit/codegen/cuda/transform_replay.h_ One of the largest differences between our approach and that of TVM/Halide, is the concept of "TensorView". TensorView from a high level should be thought of similarly to how we think of working with Tensors in PyTorch. It's an N-D object which can undergo transformations that change its dimensionality. Dimensionality changes are done through the operations split/merge/reorder/computeAt. These transformations are similar to split/fuse/reorder/compute_at of TVM, they modify how a tensor is iterated over to generate GPU code. Interestingly, in our scheme these transformations are applied to tensors and only impact how that tensor is generated. **Warning:** This PR is purposefully not feature complete with the current fuser. We wanted to separate out the infrastructure from the fusion capabilities. Once in, smaller incremental PRs will be submitted to expand capabilities of the fuser. **Short term goals:** Parity with current CUDA fuser (including performance): - Dynamic shapes (no recompilation) - Implicit handling of braodcast (broadcasted tensors are treated as tensors of the braodcasted size in the generated code) - Dropout **Mid-term goals:** - Transposes fused with pointwise operations where transpose involves only 2 axes (across the fused operation). - 1-D reductions fused with pointwise operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/34785 Reviewed By: ZolotukhinM Differential Revision: D20650977 Pulled By: soumith fbshipit-source-id: ee39c95a880e1b9822e874ed4cc180971572bf63
94 lines
2.5 KiB
C++
94 lines
2.5 KiB
C++
#pragma once
|
|
|
|
#include <ATen/cuda/CUDAContext.h>
|
|
#include <torch/csrc/WindowsTorchApiMacro.h>
|
|
|
|
#include <torch/csrc/jit/codegen/cuda/fusion.h>
|
|
|
|
/*
|
|
* The exposed APIs in this file is used by manager.h/cpp
|
|
*
|
|
* code here handles CUDA code generation and execution from Fusion IR.
|
|
* NVRTC is used for kernel compilation. CUDA Driver API is used to load and
|
|
* execute compiled kernel.
|
|
*
|
|
* A stringify trick is used to unify the IO data structure for kernel
|
|
* execution. We stringify the data structure and assert it direclty in the
|
|
* generated CUDA source to avoid runtime search of header files.
|
|
* The header file is included twice: one time as a c++ code to allow host code
|
|
* to prepare IO data; the other time for stringify.
|
|
*/
|
|
|
|
namespace torch {
|
|
namespace jit {
|
|
namespace fuser {
|
|
namespace cuda {
|
|
|
|
// include IO data structure for host code
|
|
#define STRINGIFY(...) __VA_ARGS__
|
|
#include <torch/csrc/jit/codegen/cuda/data_struct_str.h>
|
|
#undef STRINGIFY
|
|
|
|
class CudaKernel {
|
|
public:
|
|
CudaKernel() = default;
|
|
|
|
CUmodule& getModule() {
|
|
return module_;
|
|
}
|
|
|
|
CUfunction& getFunction() {
|
|
return function_;
|
|
}
|
|
|
|
int16_t device_;
|
|
CUmodule module_;
|
|
CUfunction function_;
|
|
int max_blocks_;
|
|
|
|
// WARNING:
|
|
// Block and Grid dimension setting is here for testing purposes only
|
|
// These are not here for general use and only for use with
|
|
// the runTestKernel() function.
|
|
void block(unsigned int x = 1, unsigned int y = 1, unsigned int z = 1) {
|
|
block_ = dim3(x, y, z);
|
|
}
|
|
void grid(unsigned int x = 1, unsigned int y = 1, unsigned int z = 1) {
|
|
grid_ = dim3(x, y, z);
|
|
}
|
|
|
|
dim3 block_;
|
|
dim3 grid_;
|
|
};
|
|
|
|
// include IO data structure for stringification
|
|
#define STRINGIFY(...) #__VA_ARGS__
|
|
static auto typeinfo =
|
|
#include "data_struct_str.h"
|
|
;
|
|
#undef STRINGIFY
|
|
|
|
// compile Fusion to CUDA functions:
|
|
// 1. JIT compilation via nvrtc to generate CUDA c++ kernel code;
|
|
// 2. CUDA Drive API to load CUDA c++ kernel code as function_;
|
|
TORCH_CUDA_API void compileKernel(Fusion& fusion, CudaKernel& entry);
|
|
|
|
// run loaded kernel through Function.
|
|
// inputs/outputs is given in the sense of a PyTorch JIT ir node. This function
|
|
// wraps IO data structure for tensors on host.
|
|
TORCH_CUDA_API void runKernel(
|
|
CudaKernel& entry,
|
|
const at::ArrayRef<IValue>& inputs,
|
|
std::vector<at::Tensor>& outputs);
|
|
|
|
// Facility API to run kernel in tests.
|
|
TORCH_CUDA_API void runTestKernel(
|
|
CudaKernel& entry,
|
|
const std::vector<at::Tensor>& inputs,
|
|
std::vector<at::Tensor>& outputs);
|
|
|
|
} // namespace cuda
|
|
} // namespace fuser
|
|
} // namespace jit
|
|
} // namespace torch
|