The capture_triton decorator returns a function that goes through the
triton kernel wrapper HOP. This is useful for make_fx tracing and
non-strict export. However, the HOP dispatch is slow (~1ms) and not
necessary in certain situations.
This PR skips going through the HOP dispatch for any
capture_triton-wrapped triton kernels that are registered as
implementations to a `@triton_op` custom operator. We do this by
creating a new thread-local flag that controls if the
capture_trition-wrapped triton kernel goes through HOP dispatch or not.
Test Plan:
- new test and existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132822
Approved by: https://github.com/SherlockNoMad
This is the initial version of an API to create custom operators whose
implementations are backed by triton kernels. While user-defined triton
kernels work out-of-the-box with triton kernels, you may wish to
construct a custom operator if you need to compose with other PyTorch
subsystems, like Tensor subclasses or vmap.
I'm hoping to get design feedback on this and ship it so that we can
begin experimenting with customers.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637
Approved by: https://github.com/albanD