mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported
nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor
Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127
Reviewed By: HamidShojanazeri
Differential Revision: D34113233
Pulled By: jbschlosser
fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
41 lines
1.2 KiB
C++
41 lines
1.2 KiB
C++
#pragma once
|
|
|
|
#include <c10/macros/Export.h>
|
|
#include <torch/csrc/jit/ir/ir.h>
|
|
|
|
/*
|
|
* This file handles compilation and execution of a CudaFusionGroup;
|
|
*
|
|
* A CudaFusionGroup node comes with `attr::Subgraph` containing the computation
|
|
* graph. We compile the graph to generate CUDA function and cache them in a
|
|
* registry. We cache & reuse kernels across nodes sharing identical graph.
|
|
*
|
|
* After compilation, we assign the key to cached kernel as an integer attribute
|
|
* on the node `attr::cache_id`.
|
|
*/
|
|
|
|
namespace torch {
|
|
namespace jit {
|
|
namespace fuser {
|
|
namespace cuda {
|
|
|
|
// Get fusion_node ready for execution.
|
|
// find or compile `CudaKernel` for graph stored in `attr::Subgraph`
|
|
// this function assigns `attr::cache_id` to `fusion_node`
|
|
TORCH_CUDA_CU_API void compileCudaFusionGroup(Node* fusion_node);
|
|
|
|
// Execute fusion_node.
|
|
// Current protocol is that the function allocates output tensor append them to
|
|
// `stack` after execution.
|
|
// TODO: support shape inferencing. Right now we only handles static shape
|
|
TORCH_CUDA_CU_API void runCudaFusionGroup(
|
|
const Node* fusion_node,
|
|
Stack& stack);
|
|
|
|
TORCH_CUDA_CU_API void CudaFuseGraph(std::shared_ptr<Graph>& graph);
|
|
|
|
} // namespace cuda
|
|
} // namespace fuser
|
|
} // namespace jit
|
|
} // namespace torch
|