Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68687
This adds `NativeFunction.root_name` which is the canonical name
for the operator group. i.e. the BaseOperatorName without inplace or
double-underscores. In the previous PR I referred to this as
`base_name` but confusingly `BaseOperatorName` does potentially
include inplace or double-underscores.
I also add the property to `NativeFunctionsGroup` so that grouped
functions with type `Union[NativeFunction, NativeFunctionsGroup]`
can have the property queried without needing `isinstance` checks.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D32596271
Pulled By: albanD
fbshipit-source-id: 8b6dad806ec8d796dcd70fc664604670d668cae7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69327
Original commit changeset: d44096d88265
Original Phabricator Diff: D32144240 (668574af4a)
Test Plan:
CI
original diff failed 175 builds in CI
Reviewed By: airboyang, anjali411
Differential Revision: D32809407
fbshipit-source-id: c7c8e69bcee0274992e2d5da901f035332e60071
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64432
Original PR description + feedback here: https://github.com/pytorch/pytorch/pull/63048
I've addressed all of the feedback in the original PR and made some pretty large changes, listed below.
**Table of Contents**
- Starting points
- List of the main changes from the original PR
- Next Steps
- Example codegen output (for a view, mutation, and view+mutation op)
**Starting Points**
A good place to start when looking through the PR:
* Alban mentioned that this is a useful mental model (thanks Ed for originally making this clear to me). Semantically, the pass currently does THREE things, which are all needed by functorch - all fused together into one big pass.
* (a) alias removal, which replaces {view} calls with {view}_copy calls, and manually tracks aliasing information, so that when one tensor is mutated, we re-apply the same mutation to all of the aliases. This is the bulk of the work - once this is done, the next 2 things are trivial to implement.
* (b) mutation removal, which is easy to do once we know that there are no aliases. Every mutation `a.add_(b)` becomes `a.replace_(a.add(b))`
* (c) reapplying views: all of the `{view}_copy` calls are replaced with `{view}` calls again. This is an optimization that we can make specifically for functorch (and strided backends), that only care about mutation removal and not alias removal
* XLA and Vulkan only want (a), or (a) + (b). Later, we'll want to split this out so that you can actually opt into different versions of this logic.
* There is currently no {view}_copy replacement, because the pass just <replace views with copies> and <replace copies with views> steps have been combined. Later, we'll want to actually implement {view}_copy variants of each view operator, probably with codegen.
* documentation breadcrumb 1, in `FunctionalTensorWrapper.cpp`: https://github.com/pytorch/pytorch/pull/64432/files#diff-a0bac99bf205dba5b94cb64fc2466d3d55d991887572f9cd6a02e27b3a91dd60R59 (you might have to expand the `FunctionalTensorWrapper.cpp` file, which GitHub closes by default because it's large)
* documentation breadcrumb 2, in `FunctionalTensorWrapper.h`: https://github.com/pytorch/pytorch/pull/64432/files#diff-c945c71a4ccac65871f24a912e8904f9a5088b24a32e636727ea9c8fe920708aR12
* Reading through the codegen output at the bottom of this description.
**Main changes from the original PR**
(1) I use lambdas instead of a giant enum to handle all of the different views.
This results in less boilerplate per view op (and more stuff that can be codegen'd). Every `ViewMeta` object now contains a `forward` and `reverse` lambda, that knows how to replay the view and its inverse. This makes the actual code that executes the replaying logic a lot less boilerplate-y (see `Alias::sync_update_operations` and `FunctionalTensorWrapper::sync_`)
(2) Every tensor during the functionalization pass is always wrapped in a `FunctionalTensorWrapper`.
This is potentially unnecessary for Vulkan/XLA, and will have a mild perf impact, but for now this PR just targets the functorch use case. I previously had a complicated design a (`FunctionalTensorImplBase` class) to avoid needing the wrapper for XLA, but it had some subtleties that are gonna require more thought to fix, so I'm pushing that off for now.
(3) `FunctionalTensorWrapper` objects accurately report stride information.
It's a little annoying to do this though, because the logic that calculates stride info for each view isn't easily separated from the actual view kernels in core, `at::native::{view}`. I do this by adding logic in each `at::functionalization::{view}` kernel to call the reference implementation `at::native::{view}`. I don't do anything with the output aside from taking it's size/stride/storage_offset to set the actual output tensor's size/stride/storage_offset correctly. There's another annoying part to this: I'm pretty sure that we want to pass in the actual *wrapper* tensors directly into the native kernels, not their inner unwrapped values. But there are some `at::native::{view}` kernels that call other tensor methods, which re-invokes the dispatcher, calling functionalization/functorch kernels that try do the unwrapping.
To do this, right now I have an `AutoDispatchDirectlyToNative` guard that basically ensures that any tensor methods called inside of the at::native::{view} op always redispatch straight to the CPU kernel (which will be another at::native:: kernel). This feels kind of heavy handed, but I'm not sure of a better way to do it.
(4) `FunctionalTensorWrapper` objects accurately report aliasing information.
There's a new `FunctionalStorageImpl` class (subclass of `StorageImpl`) that allows tensors in the functionalization pass to accurately alias storage. If two tensors `a` and `b` in a functionalized program are views of one another, then `a.storage.is_alias_of(b.storage)` should return true. I added this in a pretty similar way to how meta tensors allocate storage, although I don't pass in an actual allocator (I think this is fine because you should never resize a functional tensor's storage).
One thing I'm not sure about - should `FunctionalTensorWrapper` set `storage_access_should_throw_`: (a) always, (b) never, (c) only if its wrapped tensor has it set.
Right now I have it not set, mostly because calling the reference view functions (`at::native::{view}`) requires looking at the storage. But that means that if you try to access storage from python in a functionalized program, you'll get silent garbage instead of an error. Related question: are we planning on exposing meta tensor storage to python in the future (even though it contains garbage)?
(5) better docs :)
**View operator coverage**
(6) The functionalization pass now gets math-composite view ops for free.
I didn't add the `Functionalize` dispatch key to the composite set, because I don't want composite ops like `torch.ones` to get decomposed before hitting the functionalization pass. Instead, I added codegen to manually register the `at::native::` kernels of composite view ops. This is a little hairy, because the names of the `at::native::` kernels aren't easily accessible. They're stored in a `Dict[DispatchKey, BackendIndex]`. I made a best-effort attempt to get each view kernel's name, basically by assuming that every view op has either a composite or cpu implementation.
There's also a hardcoded list of composite view ops in `gen_inplace_or_view_type.py`, but it looks like it's wrong. This is probably worth rationalizing later, but instead I created a new list of the "complete" set of composite view ops, and preserved the old set by hardcoding the delta between the two sets.
(7) I've added codegen for ops that are both views AND mutations, like `transpose_()` (why do we even have these {emoji:1f622}).
From some light testing, it looks like they work correctly with one caveat: I had a hard time ensuring that functorch programs that mutate their inputs using ops like `transpose_()` preserve the input mutations after the program finishes running. For (in my corresponding functorch branch) I emit a warning when this happens, and just don't preserve the mutation
(8) I added `{view}_inverse` implementations for every view op, in `FunctionalInverses.cpp`.
These are needed to take mutations made to views and replay them back onto the base. To reduce boilerplate, the codegen generates function declarations for each `{view}_inverse` function, so you get a nice compiler error when someone eventually adds a new view op.
The only view ops currently not supported are (a) as_strided, and (b) the sparse view ops (values()/indices()).
I can add support for as_strided, but it needs an `as_strided_inverse()` function. That will look really similar to the `as_strided_backward()` function in FunctionsManual.cpp, but it has some noticeable differences: we basically want an `as_strided_embed` for autograd and `as_strided_scatter` for functionalization. We also will probably need them to be primitives w.r.t to autograd, since the currently implementation for autograd uses view().copy_() calls that XLA won't be able to handle. I'm wondering if anyone has any objections, but otherwise I can make those change (which will require writing backward formulas for `as_strided_embed` and `as_strided_scatter`).
I did a bunch of manual testing that all looks pretty good, but it's definitely not fully tested. Ed pointed out that once XLA uses this pass (or at least once there's a POC), we can just run the existing xla view test suite. Hopefully that delay is okay - if it's not, maybe we can think about using OpInfos similar to how functorch uses them for testing.
Note: there's some duplication with autograd's view code. Every `{view}_inverse` implementation is really similar to the implementation for that view listed in `derivatives.yaml`. There are some major differences though:
* the autograd implementations over those backwards functions (like `permute_backwards()`, in `FunctionsManual.cpp`) internally call other view ops. For functoinalization, we want them to (eventually call `{view}_copy` operators).
* For view ops that take a subset of the original storage, like `slice/select/diagonal/as_strided()`, the autograd backward functions fill the "spaces" in the inverse call with zeroes. For functionalizations, we want to fill them with the value of `base` at those positions. It looks like this currently applies to 6 total ops (since we can ignore composites):
* select
* slice
* diagonal
* as_stridied
* split
* split_with_sizes
A nice end state would probably be for the autograd + functoinalization codegen to both look at the same yaml (either `derivatives.yaml`, or something else), and automatically generate the right thing. I didn't leave that in scope for this PR though.
**Current State + Next Steps**
There are a bunch of followups after this PR eventually lands. Roughly in order:
* Use the current pass to register problematic composite ops in functorch. Also, nested `functionalize()` calls aren't supported yet (I mostly just need to remove some debug asserts and test it).
* Work on freeing up dispatch key space in the by deduplicating the `{backend}`/`Autograd{backend}`/`Sparse{backend}`/`Quantized{backend}` keys
* Once we have more dispatch keys, split up this pass into 3 pieces - it's currently fused, and doesn't do the right thing for vulkan/XLA. Specifically, all of the `{view}` calls in the current pass's view-replay logic should turn into `{view}_copy` calls that vulkan/XLA know how to implement, and there will be separate passes for (a) removing mutations, and (b) turning `{view}_copy` calls back into `{view}` calls. For Vulkan, we eventually want a pass that ONLY removes aliasing and view calls, and doesn't remove mutations. We can also probably make the 2 new passes user dispatch keys to save dispatch key space, if they'll only be used by functorch anyway.
* Do more of a dive on perf for the vulkan/xla use cases. There are several areas to improve perf with varying levels of effort required. The simplest one that I'll probably do regardless is to codegen the out-of-place kernels instead of using a boxed fallback. Getting a POC working for xla will also be useful to test the view operator coverage.
**Example Codegen Output**
View Op:
```
::std::vector<at::Tensor> split_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, int64_t split_size, int64_t dim) {
auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
::std::vector<at::Tensor> out;
{
at::AutoDispatchBelowFunctionalize guard;
auto tmp_output = at::redispatch::split(ks & c10::after_func_keyset, self_, split_size, dim);
out = at::functionalization::impl::wrapFunctionalTensor(tmp_output);
// I'm fusing the [alias removal], [mutation removal], [add views back] passes together.
// Later, we'll want to turn them into separate passes (since e.g. vulkan only cares about alias removal).
}
at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
[split_size, dim](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
return base.split(split_size, dim)[mutated_view_idx];
},
[split_size, dim](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
return at::functionalization::impl::split_inverse(base, mutated_view, mutated_view_idx, split_size, dim);
}
);
at::functionalization::impl::set_view_meta(out, self, view_meta);
at::AutoDispatchDirectlyToNative native_guard;
::std::vector<at::Tensor> reference_tensor_output = at::native::split(self, split_size, dim);
at::functionalization::impl::set_strides(out, reference_tensor_output);
return out;
}
```
Mutation Op:
```
at::Tensor & add__Tensor(c10::DispatchKeySet ks, at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {
at::functionalization::impl::sync(self);
at::functionalization::impl::sync(other);
auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
auto other_ = at::functionalization::impl::unwrapFunctionalTensor(other);
at::Tensor tmp_output;
{
at::AutoDispatchBelowFunctionalize guard;
// The functionalization pass explicitly doesn't pass out= parameters to the redispatch
tmp_output = at::redispatch::add(
ks & c10::after_func_keyset, self_, other_, alpha);
}
self.replace_(tmp_output);
at::functionalization::impl::maybe_add_update(self);
return self;
}
```
View + Mutation Op:
```
at::Tensor & transpose_(c10::DispatchKeySet ks, at::Tensor & self, int64_t dim0, int64_t dim1) {
at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
[dim0, dim1](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
return base.transpose(dim0, dim1);
},
[dim0, dim1](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
return at::functionalization::impl::transpose_inverse(base, mutated_view, dim0, dim1);
}
);
at::functionalization::impl::mutate_view_meta(self, view_meta);
// See Note [Propagating strides in the functionalization pass]
// Directly update the sizes/strides/storage_offset fields on self using the inplace call.
// I need the guard because I don't want the at::native kernel to end up calling more functionalization/functorch kernels.
// Its only job is to directly compute the output size/stride/storage_offset metadata.
at::AutoDispatchDirectlyToNative native_guard;
at::native::transpose_(self, dim0, dim1);
return self;
}
```
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31942093
Pulled By: bdhirsh
fbshipit-source-id: b95598dae35dd1842fa8b1d8d1448332f3afaadf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63094
This PR:
- Moves `FileManager` and its dependencies (`assert_never` and other imports) to `utils.py`, and updates all of the call-sites with the fresh imports
- Passes the list of NativeFunction objects into `gen_trace_type` directly, instead of requiring the function to regenerate it (we already have it)
The purpose of the reshuffling is to avoid circular dependencies in the next PR, where I add codegen for the functionalization pass, which gets called from `gen.py` (but depends on some stuff from the autograd codegen - in partulcar, the list of view ops).
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D31942096
Pulled By: bdhirsh
fbshipit-source-id: 36118facae61f25f8922bb43ad2818c80b53504e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61746
**Summary**
This commit introduces a new feature for structured kernels that allows
kernels to declare quantities as "precomputed" in
`native_functions.yaml`, compute them once in the `meta` function and
reuse them again in the `impl`. The names and types of these quantities
are used to generate code for a struct containing them that the `meta`
function must return. In the case of a handful of surveyed kernels
(`all,`, `any`, `avg_pool2d`), these quantities that are used both in
the `meta` and `impl` have the same meaning as certain kernel arguments
and in fact supersede them. Accordingly, the correspondence between a
kernel argument and the precomputed elements that supersede it is also
captured in `native_functions.yaml`. This information is used to unpack
the struct returned by `meta` and pass its contents correctly to the
`impl` function.
The primary goal is to avoid recompute and enhance developer experience
(e.g. sometimes people can forget to compute these elements while
porting a kernel).
Test Plan: Imported from OSS
Reviewed By: tugsbayasgalan
Differential Revision: D30407831
Pulled By: SplitInfinity
fbshipit-source-id: 00975525ea373721fe52d06f75cd4ac91f3dc556
Summary:
This PR implements the necessary hooks/stubs/enums/etc for complete ONNX Runtime (ORT) Eager Mode integration. The actual extension will live out of tree at https://github.com/pytorch/ort.
We have been [working on this at Microsoft](https://github.com/microsoft/onnxruntime-pytorch/tree/eager-ort/torch_onnxruntime) for the last few months, and are finally ready to contribute the PyTorch core changes upstream (nothing major or exciting, just the usual boilerplate for adding new backends).
The ORT backend will allow us to ferry [almost] all torch ops into granular ONNX kernels that ORT will eagerly execute against any devices it supports (therefore, we only need a single ORT backend from a PyTorch perspective).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58248
Reviewed By: astaff
Differential Revision: D30344992
Pulled By: albanD
fbshipit-source-id: 69082b32121246340d686e16653626114b7714b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58661
I removed dispatch: CompositeImplicitAutograd: math_silu_backward
Definitely not right, but I don't know how it works with structured core.
Keeping it will trigger an assertion failure
```
assert dispatch.keys() != {DispatchKey.CompositeImplicitAutograd}, \
f"unexpected name for singleton CompositeImplicitAutograd dispatch entry: expected {cpp.name(func)} " \
f"but got {dispatch[DispatchKey.CompositeImplicitAutograd]}. Rename your implementation to the expected " \
"name, then delete the dispatch table"
```
Test Plan: Imported from OSS
Reviewed By: soulitzer
Differential Revision: D28572530
Pulled By: ezyang
fbshipit-source-id: 410f03bddf79cda7c9f0fd66f697383ee2925d32
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60214
Relanding this PR, but with a fix for windows cuda builds (example failure in master here: https://github.com/pytorch/pytorch/runs/2852662871)
This is identical to the original PR except for one change in `tools/codegen/gen.py`: `static constexpr` -> `static CONSTEXPR_EXCEPT_WIN_CUDA`
This actually took a while to figure out, until I tracked down a previous pytorch PR that encountered a similar issue: https://github.com/pytorch/pytorch/pull/40675
This reverts commit 6d0fb85a62.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D29213932
Pulled By: bdhirsh
fbshipit-source-id: b90c7c10e5a51f8d6173ddca673b418e5774c248
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59115
This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator.
### Changes
For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`.
Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`):
```
struct TORCH_API add_Tensor {
using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &);
using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &);
static constexpr const char* name = "aten::add";
static constexpr const char* overload_name = "Tensor";
static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor";
static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha);
static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot
};
```
What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`.
### Motivation
There were two motivations for this change:
**Codegen refactor**
The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`.
This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files.
**Extra compile-time metadata**
In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them.
The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples)
```
return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0);
```
**Characteristics of the `at::_ops` API**
(I also commented this in the codegen)
(1) It follows the Dispatcher API.
This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again.
(2) Overload names are disambiguated.
This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call)
(3) No argument defaulting is allowed.
This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though!
(4) manual_cpp_bindings and faithful names are not included in the API.
I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API).
**Details**
Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway.
The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to:
- Not allow defaulting in the `at::_ops` API
- Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`)
It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere.
Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`.
It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header.
**Perf**
I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined.
There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D28833086
Pulled By: bdhirsh
fbshipit-source-id: 55f322a8378cb9a3cb6642f72aa291be381dd95b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57510
This is a re-write of https://github.com/pytorch/pytorch/pull/56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.
The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D28474364
Pulled By: bdhirsh
fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57361
Data model change in the codegen, which splits backend-specific information out of `NativeFunction`
### Overview
Currently in the codegen, native_functions.yaml has backend-specific information about each operator that is encoded directly into the data model, in the `NativeFunction` object. That's reasonable, since the native_functions.yaml is the source of truth for information about an operator, and the data model encodes that information into types.
Now that external backends can use the codegen though, that information is technically incomplete/inaccurate. In another PR, I tried patching the information on the `NativeFunction` object with the additional external information, by updating the `dispatch` entry to contain the external backend kernel name and dispatch key.
Instead, this PR tries to split out that information. The `NativeFunction` class contains all information about an operator from native_functions.yaml that's backend-independent and is known never to change regardless of what extra information backends provide. We also build up a backend "index", which is basically a mapping from [backend] -> [backend-specific-metadata]. Reading in an external backend yaml just involves updating that index with the new backend.
There were a few places where `NativeFunction` used the dispatch table directly, that I encoded as properties directly on the NativeFunction object (e.g. `is_abstract`). They were mostly around whether or not the operator has a composite kernel, which isn't something that's going to change for any external backends.
This has a few advantages:
- We can more easily re-use the existing logic in `native_function.py` and `register_dispatch_key.py` for both native and external backends, since they both involve a NativeFunction + a particular backend index
- The data in the data model will be the same regardless of how the codegen is run. Running the codegen with a new external backend doesn't change the data inside of NativeFunction or an existing backend index. It just adds a new index for that backend.
- There are several of codegen areas that don't care about backend-specific information: mostly the tracing and autograd codegen. We can reason about the codegen there more easily, knowing that backend-specific info is entirely uninvolved.
An alternative to this split would be to augment the NativeFunction objects with external backend information at the time that we create them. So the external codegen could read both native_functions.yaml and the external backend's yaml at the same time, and construct a NativeObject with a full dispatch table (including the XLA entry), and the correct setting of structured (taking into account both yamls). One disadvantage to this approach is that NativeFunction objects now contain different stuff depending on how you ran the codegen, and you have to make sure that any changes to the codegen can properly handle all the different variants.
### Data Model Changes
Removed 3 classes, which are used by the external codegen:
- ExternalBackendFunction
- ExternalBackendFunctionsGroup
- ExternalBackendMetadata
And added two new ones:
- BackendIndex
- BackendMetadata
`BackendIndex` contains any info that's specific to that backend, plus a mapping from operator names to backend specific metadata about the operator. One example of backend-specific info that's not operator-dependent is the fact that XLA prefers to implement functional kernels instead of out kernels (and so when they eventually mark an op as structured, they're going to mark the functional op and not the out op).
`BackendMetadata` contains info specific to an (operator, backend) pair. Right now, that's just (a) the name of the kernel, and (b) whether or not that operator is structured.
### Questions
I wanted to get this PR up earlier so I could get feedback, but there are a few things I want to call out:
**Dealing with `structured`.**
This PR separates out the notion of `structured` into two bits of information:
- Does [operator] have a meta() function. This is backend-agnostic, and is represented by the `structured` property on `NativeFunction`, same as before. This is used, e.g., to decide what signatures to add to `MetaFunctions.h`.
- Does [operator, backend] have an impl() function. This is backend dependent; even though technically all in-tree backends are forced to write impl() functions for an operator when we port the op to structured in native_functions.yaml, out-of-tree backends can decide to opt in independently. This is represented as a property on `BackendMetadata`. This is used in most other cases, e.g. in `RegisterDispatchKey` when we're deciding whether or not to gen a structured or unstructured wrapper.
I also baked `is_structured_dispatch_key` directly into each BackendIndex. So for operators marked "structured" in native_functions.yaml, their corresponding CPU/CUDA BackendIndex entries will be marked structured, and all others (except for potentially external backends) will not.
I ended up trying to deal with `structured` in this change since it's technically backend dependent (XLA can opt kernels into structured separately from in-tree ops), but that may have been too ambitious: it's technically not relevant until we actually add support for structured external kernels. If it's not clear that this is the right path for dealing with structured and we want to push that off, I'm fine with backing out the bits of this PR that make `structured` backend-dependent. I don't see anything *too* controversial related to structured in the change, but I tried to call out any areas in the comments
**Localizing the fact that external backends follow Dispatcher convention.**
Another thing that's sort of backend specific that I didn't totally address in this PR is the fact the fact that in-tree backends follow the Native API while external backends follow the Dispatcher API. I painted over that in `native_functions.py` by adding a helper, `kernel_signature`, that takes in a native function and gives you the "correct" signature for the specified backend- NativeSignature for in-tree backends, and DispatcherSignature for out-of-tree backends. In order to make that fully useable though, we'll need `NativeSignature` and `DispatcherSignature` to have matching interfaces. I didn't bother with that in this PR, which is why `gen_external_aten_fallbacks.py` still has a bunch of direct references to the dispatcher API. Thinking of adding it in a later PR but wanted to see if anyone has other opinions.
Maybe `is_external()` shouldn't even be a property on the BackendMetadata, and anything the codegen does that requires asking for that information should just be better abstracted away.
**Thoughts on the `BackendIndex` / `BackendMetadata` breakdown.**
One thing that's annoying right now is that to query for various pieces of metadata, you call helper functions like `backend_index.structured(f)`, which queries that particular backend and tells you if that specific NativeFunctionGroup is structured for that backend. It has to return an `Optional[bool]` though, since you have to handle the case where that operator doesn't have a kernel for that backend at all. So users of those helpers end up with a bunch of optionals that they need to unpack, even if they know at some point that the result isn't None. I think it would be easier instead to just store the NativeFunction object as a field directly on the BackendMetadata. Curious if there are any other opinions on a better way to model it though.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D28474362
Pulled By: bdhirsh
fbshipit-source-id: 41a00821acf172467d764cb41e771e096542f661
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56871
foreach kernels fall back to slow path when tensor are on different devices
Generated by codemod:
```
fastmod '(- func: _foreach.*)' '${1}
device_check: NoCheck # foreach kernels fall back to slow path when tensor are on different devices' aten/src/ATen/native/native_functions.yaml
```
ghstack-source-id: 127914017
Test Plan: autotest
Reviewed By: ezyang
Differential Revision: D27986560
fbshipit-source-id: b0cd963cdba04b4e1589bbf369eb26b48d523968
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56601
Updating it to ensure that RegistrationDeclarations.yaml is completely
unchanged
This reverts commit 90e532f3ef.
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D27915305
Pulled By: bdhirsh
fbshipit-source-id: 491a025c44221690dad849f9a2166934130c0fec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55351
We incorrectly used `Tensor&` to mean "the underlying
TensorImpl cannot be changed", as explained in
https://github.com/zdevito/ATen/issues/27#issuecomment-330717839 .
This diff gets us on the path to fixing this problem: we have an
incremental way to fix individual native functions so that we can
apply any handwritten fixes a few at a time. It gets the migration
started with the `resize` family of native functions.
ghstack-source-id: 127092677
Test Plan: fitsships
Reviewed By: ezyang
Differential Revision: D27583983
fbshipit-source-id: 4eeeec85f5d268e9d0f1645eb9396914a9f9557f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56082
The native_functions.yaml changes were done by codemod using the
following script:
```
import ruamel.yaml
from ruamel.yaml.tokens import CommentToken
from ruamel.yaml.error import CommentMark
from tools.codegen.model import * # noqa: F403
with open("aten/src/ATen/native/native_functions.yaml", "r") as f:
contents = f.read()
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.width = 1000
yaml.boolean_representation = ['False', 'True']
r = yaml.load(contents)
convert = '''\
acos
acosh
asin
asinh
atan
atanh
cos
cosh
digamma
erf
erfc
erfinv
exp
expm1
exp2
lgamma
log
log10
log1p
log2
reciprocal
sigmoid
sin
sinc
sinh
special_entr
sqrt
tan
tanh'''.split()
for e in r:
f = NativeFunction.from_yaml(e, Location("", 0))
if f.structured or f.structured_delegate is not None:
continue
n = f.func.name.name.base
if n not in convert:
continue
# mutate e to make changes
if f.func.kind() == SchemaKind.out:
e.insert(1, 'structured', True)
e.insert(2, 'structured_inherits', 'TensorIteratorBase')
else:
# TODO: The .out overload assumption is not sound in general
e.insert(1, 'structured_delegate', f'{n}.out')
e['dispatch'].pop('CPU', None)
e['dispatch'].pop('CUDA', None)
e['dispatch'].pop('CPU, CUDA', None)
e['dispatch'].pop('CompositeExplicitAutograd', None)
*_, last_k = e.keys()
needs_fixup = False
if not e['dispatch']:
if last_k == 'dispatch':
needs_fixup = True
del e['dispatch']
# Manually fix up newlines at the end, because ruamel
# made some bad life choices about where to associate trailing
# whitespace for nested dicts; see
# https://stackoverflow.com/questions/42172399/modifying-yaml-using-ruamel-yaml-adds-extra-new-lines
if needs_fixup:
*_, last_k = e.keys()
# post_key, pre_key, post_value, pre_value
e.ca.items[last_k] = [None, None, CommentToken('\n\n', CommentMark(0), None), None]
with open("aten/src/ATen/native/native_functions.yaml.new", "w") as f:
yaml.dump(r, f)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D27777769
Pulled By: ezyang
fbshipit-source-id: 1ecbac7cb3e0093167bb61c7d2b1ecb95b8ae17c
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html
This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).
This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838
Test Plan: CI. You can also run `flake8` locally.
Reviewed By: jbschlosser
Differential Revision: D27724232
Pulled By: samestep
fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54403
A few important points about InferenceMode behavior:
1. All tensors created in InferenceMode are inference tensors except for view ops.
- view ops produce output has the same is_inference_tensor property as their input.
Namely view of normal tensor inside InferenceMode produce a normal tensor, which is
exactly the same as creating a view inside NoGradMode. And view of
inference tensor outside InferenceMode produce inference tensor as output.
2. All ops are allowed inside InferenceMode, faster than normal mode.
3. Inference tensor cannot be saved for backward.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D27316483
Pulled By: ailzhang
fbshipit-source-id: e03248a66d42e2d43cfe7ccb61e49cc4afb2923b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54969
With all use cases to hacky wrapper removed, all kernels will be
dispatched with c10 full dispatcher.
ghstack-source-id: 125434790
Test Plan: buck build //caffe2/aten/...
Reviewed By: ezyang, walterddr
Differential Revision: D27436596
fbshipit-source-id: 7a146d1f4a983b4a81f8552be4eec6c482b6bea2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54470
```
git grep -l 'DefaultBackend' | xargs sed -i 's/DefaultBackend/CompositeExplicitAutograd/g'
```
Plus a quick fixup in native/README.md
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Differential Revision: D27253240
Pulled By: ezyang
fbshipit-source-id: 964df951ea8b52fa72937f3cc66aeaf49a702e6f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54466
I had to very carefully audit all the use sites since there are a lot
of other uses of the string Math; I did most of the conversion by
grepping for all occurrences of Math and then doing a search
replace.
I also updated documentation for clarity.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D27253239
Pulled By: ezyang
fbshipit-source-id: afb485d07ff39575742a4f0e1e205179b60bc953
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54436
An operator entry with no dispatch table implicitly generates a Math
entry, so you don't need to define one yourself. I also added
some asserts in the codegen to fail on these cases.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D27235381
Pulled By: ezyang
fbshipit-source-id: f8c905090b863120f4f3656c37e2b7f26e8bb9ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54427
A StructuredNativeFunctions is no longer guaranteed to actually
be structured (test structured property for that), so we rename
this to a more neutral name.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D27235380
Pulled By: ezyang
fbshipit-source-id: 2b438d615bf06a47fc9c7bf6eb66fd8b4df31bc8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54426
Previously, we only put NativeFunctions in StructuredNativeFunctions
if the out variant advertised that the kernel was structured. However,
there are a few code generation things that can take advantage of
this trio structure, even if the kernel itself hasn't been ported
to be structured. So better to always group things when they are
related, and then let clients decide whether or not to use the
structure or throw it away.
While doing this, I had hoped that there weren't any functional/inplace
pairs that didn't also have an out variant. This turned out to not
be true. These are probably all oversights and should get fixed at
some point.
Bill of changes:
- The actual operational change happens in
StructuredNativeFunctions.from_dict; then I need to relax some
__post_init__ invariants. To tell if a StructuredNativeFunctions
is actually structured, there is a new structured property, which
is queried from a few new locations in code
- Refactor native_functions.py into gen_structured/gen_unstructured
functions so I can easily call gen_unstructured from two contexts
I intend to s/StructuredNativeFunctions/NativeFunctionsGroup/ but
for ease of review this rename hasn't been done in this PR.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D27235379
Pulled By: ezyang
fbshipit-source-id: d8a15de9abb75b365348ab94e67b830704e30cf0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51508
No substantive changes. The codegen for this file was getting a
bit long so I moved it off into tools.codegen.dest submodule (I
wanted to do tools.codegen.gen but that conflicts with the existing
module; oy vey!) To do this I had to move some other functions around
so that they were more generally accessible. Otherwise
self-explanatory.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ljk53
Differential Revision: D26187856
Pulled By: ezyang
fbshipit-source-id: fd3784571d03d01c4acb7ca589fcde4492526408
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51115
Add enum type for dispatch key. Prepare to implement the DispatchTable
computation logic in python for static dispatch.
Verified byte-for-byte compatibility of the codegen output.
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D26077430
Pulled By: ljk53
fbshipit-source-id: 86e74f3eb32266f31622a2ff6350b91668c8ff42
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50794
Original commit changeset: b4a7948088c0
There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised.
* There's a bugfix in the codegen to test if a dispatch key is structured *before* short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU)
* Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them.
* It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this
* There are some extra sanity checks for NativeFunctions validation
* Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d.
ghstack-source-id: 120038038
Test Plan:
```
buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled
```
Reviewed By: ngimel
Differential Revision: D25962873
fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50769
There were a couple new of these lines added in the last couple of days but they're not necessary anymore.
This PR removes them and also adds an assertion to make sure we don't add any more.
ghstack-source-id: 120133715
Test Plan: waitforsandcastle
Reviewed By: bhosmer
Differential Revision: D25961316
fbshipit-source-id: e2befc5b6215b42decb2acedcacfb50734857e2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49164
This PR removes the logic paths in codegen that were responsible for handling non-c10-full ops.
This only goes through our basic codegen. It does not simplify C++ code yet and it does not remove the codegen for generated unboxing wrappers yet.
ghstack-source-id: 119450487
Test Plan: waitforsandcastle
Reviewed By: ezyang
Differential Revision: D25462977
fbshipit-source-id: 7e70d14bea96948f5056d98125f3e6ba6bd78285
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49735
This is the final wave of autograd codegen data model migration.
After this PR:
- autograd codegen no longer depends on Declarations.yaml;
- autograd codegen sources are fully type annotated and pass mypy-strict check;
To avoid potential merge conflicts with other pending PRs, some structural
changes are intentionally avoided, e.g. didn't move inner methods out, didn't
change all inner methods to avoid reading outer function's variables, and etc.
Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
.jenkins/pytorch/codegen-test.sh <baseline_output_dir>
.jenkins/pytorch/codegen-test.sh <test_output_dir>
Then run diff to compare the generated files:
diff -Naur <baseline_output_dir> <test_output_dir>
```
Confirmed clean mypy-strict run:
```
mypy --config mypy-strict.ini
```
Test Plan: Imported from OSS
Reviewed By: ezyang, bhosmer
Differential Revision: D25678879
Pulled By: ljk53
fbshipit-source-id: ba6e2eb6b9fb744208f7f79a922d933fcc3bde9f