pytorch/torch/csrc/jit/codegen/cuda/docs
jjsjann123 99e0a87bbb [nvFuser] Latency improvements for pointwise + reduction fusion (#45218)
Summary:
A lot of changes are in this update, some highlights:

- Added Doxygen config file
- Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR)
- Improved latency with dynamic shape handling for the fusion logic
- Prevent recompilation for pointwise + reduction fusions when not needed
- Improvements to inner dimension reduction performance
- Added input -> kernel + kernel launch parameters cache, added eviction policy
- Added reduction fusions with multiple outputs (still single reduction stage)
- Fixed code generation bugs for symbolic tiled GEMM example
- Added thread predicates to prevent shared memory form being loaded multiple times
- Improved sync threads placements with shared memory and removed read before write race
- Fixes to FP16 reduction fusions where output would come back as FP32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45218

Reviewed By: ezyang

Differential Revision: D23905183

Pulled By: soumith

fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
2020-09-24 23:17:20 -07:00
..
images [nvFuser] Latency improvements for pointwise + reduction fusion (#45218) 2020-09-24 23:17:20 -07:00
.gitignore [nvFuser] Latency improvements for pointwise + reduction fusion (#45218) 2020-09-24 23:17:20 -07:00
documentation.h [nvFuser] Latency improvements for pointwise + reduction fusion (#45218) 2020-09-24 23:17:20 -07:00
fuser.doxygen [nvFuser] Latency improvements for pointwise + reduction fusion (#45218) 2020-09-24 23:17:20 -07:00
main_page.md [nvFuser] Latency improvements for pointwise + reduction fusion (#45218) 2020-09-24 23:17:20 -07:00