Summary:
Have basic reduction fusion working, and have improved code generator to approach performance of eager mode reductions. Coming soon will be pointwise-reduction fusions in a way that should prevent the possibility of hitting regressions. Also working on performant softmax kernels in the code generator which may be our next fusion target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40864
Reviewed By: ngimel
Differential Revision: D22392877
Pulled By: soumith
fbshipit-source-id: 457448a807d628b1035f6d90bc0abe8a87bf8447
Summary:
We've got quite a few things going on, preparing a push back to upstream so we don't get too desynced.
- Major refactor of transform replay. It is now far more robust and fixes bugs discovered in reductions. Preparing for extension to explicit broadcast ops which will be the last major memory pattern for op coverage. Broadcast ops will allow us to express up to and potentially beyond norms and gemms.
- Initial runtime expression evaluator. This allows us to evaluate expressions at runtime. Will be useful for determining our grid/block layout at runtime, so we don't have to manually compute them according to the code we're trying to generate.
- Moving to int64 and double for scalar representations to match PyTorch JIT.
- Improvements in codegen interface where we return Tensor like object instead of parent class Val.
- Add `addcmul` and `lerp` ops
- General updates, fixes, test additions, test inprovements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39579
Differential Revision: D21974001
Pulled By: soumith
fbshipit-source-id: 7f7ccc91593466e948f3ce90f8f9b7fbc5c28de2
Summary:
Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.
The two remaining pieces missing for reduction support is:
- Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
- Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator
PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.
Also working towards reductions and shape inference for reductions in the fusion pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38627
Reviewed By: albanD
Differential Revision: D21663196
Pulled By: soumith
fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
Summary:
This PR added more supported operations in CUDA fuser. We are covering major point-wise operations supported in legacy fuser.
In an attempt to adapt to legacy executor:
1. added an naive shape propagation pass on pytorch JIT IR;
2. small refactor on graph partitioning;
3. fallback interpreter execution of fusion group;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37849
Reviewed By: yf225
Differential Revision: D21444320
Pulled By: soumith
fbshipit-source-id: 712e18ab8497f8d58a07e6f8d200cdab52cf0d74
Summary:
Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is:
https://godbolt.org/z/i0uAv3
What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36435
Reviewed By: ZolotukhinM
Differential Revision: D21024011
Pulled By: soumith
fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0
Summary:
This PR completely refactors the code lowering process from our IR to CUDA. Before we had one giant step that would go from a relatively high level IR straight to CUDA, now we're lowering this first into concepts like ForLoop, IfThenElse, TensorIndex, Allocate. This lowering will allow us to do more complex code lowering like reductions and unrolling. Unrolling will quickly follow this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36199
Reviewed By: dzhulgakov
Differential Revision: D20925220
Pulled By: soumith
fbshipit-source-id: 8f621c694c68a1aad8653e625d7287fe2d8b35dc