Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48718
This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check https://github.com/pytorch/rfcs/pull/9 for a mostly up-to-date high level description of what's going on here.
High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D25278031
Pulled By: ezyang
fbshipit-source-id: 57c43a6e5df21929b68964d485995fbbae4d1f7b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48182
I'm planning to add a bunch more argument fields following
https://github.com/pytorch/pytorch/pull/45890#discussion_r503646917 and
it will be a lot more convenient if the arguments get to live
in their own dedicated struct. Type checker will tell you if
I've done it wrong. No change to output.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D25057897
Pulled By: ezyang
fbshipit-source-id: dd377181dad6ab0c894d19d83408b7812775a691
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45277
Implements structured kernels as per https://github.com/pytorch/rfcs/pull/9 and ports upsample_nearest1d to use the framework.
The general structure of this diff:
- Define a new syntax for specifying structured kernels in `native_functions.yaml`. You put `structured: True` on the `out` function (that's what you implement) and `structured_delegate: foo.out` on the functional/inplace variants to define them in terms of the `out` function. There's a bunch of new consistency checking to see if you've done this right, though the error messages are of varying quality. This is most of what's going on in tools.codegen.model
- NativeFunctionGroup turns into StructuredNativeFunctions. Previously I thought that maybe we would use this grouping mechanism for both structured and unstructured kernels, but it turned out that Jiakai needed to make his own grouping structure. So now I've specialized it for structured kernels, which also means I get to add a bunch of invariants, like requiring structured kernels to have both a functional and an out variant. This is the lower bundle of changes in tools.codegen.model
- When you make an out kernel structured, this induces us to generate a new meta function signature for you to write shape checking and output allocation code. The signatures of these is defined by `tools.codegen.api.meta` and generated into `MetaFunctions.h`. Coverage here is very bare bones and will be driven by actual operators we port as we go.
- The meaty part of code generation is what we do when we have some grouped StructuredNativeFunctions. We continue to generate a wrapper per function type, but they're are a bit different as the call your meta functions, and make reference to the actual implementations in out.
- Then there's a port of `upsample_nearest1d`; easiest to review by just looking at what the final code looks like.
Missing pieces:
- Stride calculation in TensorMeta
- Sufficient sanity checking for inplace/out variants
- Enough rope to make TensorIterator work
This PR improves instruction counts on `upsample_nearest1d` because it eliminates an extra redispatch. Testing `at::upsample_nearest1d(x, {10});`
* Functional: before 1314105, after 1150705
* Out: before 915705, after 838405
These numbers may be jittered up to +-16400 (which is the difference when I tested against an unaffected operator `at::upsample_linear1d`), though that may also because unrelated changes affected all operators globally.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D24253555
Test Plan: Imported from OSS
Reviewed By: smessmer
Pulled By: ezyang
fbshipit-source-id: 4ef58dd911991060f13576864c8171f9cc614456