Commit Graph

30 Commits

Author SHA1 Message Date
Guo Yejun
68ea9e9df5 gen_backend_stubs.py: fix typo for supported_autograd (#68562)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68562

Reviewed By: jbschlosser

Differential Revision: D32758608

Pulled By: bdhirsh

fbshipit-source-id: 496e1ec831edaa6fcc586f3c8f0361c31cad4e78
2022-02-07 17:26:04 -08:00
Brian Hirsh
7b8c43cd7c Revert "Revert D32498570: make codegen'd device guards not cuda-specific. Allow them to be used in external codegen" (#69951)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69951

This reverts commit 0ef523633f.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33113543

Pulled By: bdhirsh

fbshipit-source-id: b28073ee0870b413ea9f617f27671ae5c6f3c696
2022-01-04 14:53:21 -08:00
Brian Hirsh
bb5b4cceb6 Revert "Revert D32498569: allow external backend codegen to toggle whether to generate out= and inplace kernels" (#69950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69950

This reverts commit f6cad53443.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D33113545

Pulled By: bdhirsh

fbshipit-source-id: d6590294662588d36c09662dea65919ad4e1e288
2022-01-04 14:52:00 -08:00
Brian Hirsh
5e222d08a1 Revert "Revert D32498572: allow external backend codegen to be used without autograd kernels" (#69949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69949

This reverts commit 33363cea64.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D33113544

Pulled By: bdhirsh

fbshipit-source-id: e219f10d52776498c9ad273e97bca3e3406cf702
2021-12-21 08:19:37 -08:00
Peter Bell
7bb4b683b5 Codegen: Registration now only includes the functions used (#68689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68689

Currently Register{DispatchKey}.cpp includes all of
`NativeFunctions.h`, so any operator signature change requires all
backend registration to be recompiled. However, most backends only
have registrations for a small fraction of operators so it makes sense
to only include the specific functions required.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D32596273

Pulled By: albanD

fbshipit-source-id: 11d511f47937fbd5ff9f677c9914277b5d015c25
2021-12-15 14:29:32 -08:00
Brian Hirsh
33363cea64 Revert D32498572: allow external backend codegen to be used without autograd kernels
Test Plan: revert-hammer

Differential Revision:
D32498572 (b83b6f7424)

Original commit changeset: 3e7159c633f6

Original Phabricator Diff: D32498572 (b83b6f7424)

fbshipit-source-id: f93fa444c95a2423eef5975a2ecdb96f14e0c535
2021-12-14 15:28:49 -08:00
Brian Hirsh
f6cad53443 Revert D32498569: allow external backend codegen to toggle whether to generate out= and inplace kernels
Test Plan: revert-hammer

Differential Revision:
D32498569 (aa0cf68c17)

Original commit changeset: ebd932d042b9

Original Phabricator Diff: D32498569 (aa0cf68c17)

fbshipit-source-id: 21a393fa339510d926512a7983d33ece327b743d
2021-12-14 15:27:24 -08:00
Brian Hirsh
0ef523633f Revert D32498570: make codegen'd device guards not cuda-specific. Allow them to be used in external codegen
Test Plan: revert-hammer

Differential Revision:
D32498570 (2e7a91c45f)

Original commit changeset: 0ce6a5614417

Original Phabricator Diff: D32498570 (2e7a91c45f)

fbshipit-source-id: 7c64ce1b5e51a680b4aeae8721e0c9e15c793289
2021-12-14 15:04:10 -08:00
Brian Hirsh
2e7a91c45f make codegen'd device guards not cuda-specific. Allow them to be used in external codegen (#68531)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68531

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32498570

Pulled By: bdhirsh

fbshipit-source-id: 0ce6a5614417671313b4d274ea84742c5b81d1b0
2021-12-14 10:25:04 -08:00
Brian Hirsh
aa0cf68c17 allow external backend codegen to toggle whether to generate out= and inplace kernels (#68530)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68530

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32498569

Pulled By: bdhirsh

fbshipit-source-id: ebd932d042b988e19c71aa04a21677db9bdc9f04
2021-12-14 10:25:02 -08:00
Brian Hirsh
b83b6f7424 allow external backend codegen to be used without autograd kernels (#68529)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68529

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D32498572

Pulled By: bdhirsh

fbshipit-source-id: 3e7159c633f6a80b60faa068436a4c49ebe731ca
2021-12-14 10:23:12 -08:00
Will Constable
397183f44c Add Lazy Tensor codegen infra (#69020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69020

Merges the lazy tensor codegen infra which has already been used on lazy_tensor_staging.

Test Plan: Test via lazy_tensor_staging branch

Reviewed By: alanwaketan, bdhirsh

Differential Revision: D32570613

fbshipit-source-id: 2cd5698644398bda69669683f8de79fd3b6639b5
2021-12-02 07:51:52 -08:00
Sean Silva
2e156f649e Sort output of *NativeFunctions.h (#67046)
Summary:
This ensures deterministic output, allowing systems like ccache to be
more effective.

cc ezyang bhosmer bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67046

Reviewed By: VitalyFedyunin

Differential Revision: D31896114

Pulled By: bdhirsh

fbshipit-source-id: d29ef0cf6c7e3408b104c5239b620eaa24327088
2021-10-29 09:03:39 -07:00
Brian Hirsh
665c148e42 move some codegen utilities into utils.py (#63094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63094

This PR:
- Moves `FileManager` and its dependencies (`assert_never` and other imports) to `utils.py`, and updates all of the call-sites with the fresh imports
- Passes the list of NativeFunction objects into `gen_trace_type` directly, instead of requiring the function to regenerate it (we already have it)

The purpose of the reshuffling is to avoid circular dependencies in the next PR, where I add codegen for the functionalization pass, which gets called from `gen.py` (but depends on some stuff from the autograd codegen - in partulcar, the list of view ops).

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942096

Pulled By: bdhirsh

fbshipit-source-id: 36118facae61f25f8922bb43ad2818c80b53504e
2021-10-28 10:49:17 -07:00
Peter Bell
a9983ac09c Refactor structured set_output in Register{DispatchKey}.cpp (#62188)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62188

These parts of the `set_output` code are identical for all operators in the
kernel registration files. So, this moves them from being copied into every
class to two helper functions at the top of the file.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29962045

Pulled By: albanD

fbshipit-source-id: 753b8aac755f3c91b77ffa2c30a89ac91a84b7c4
2021-08-27 09:38:27 -07:00
Peter Bell
d454c9e76e Migrate THCTensor_copyIgnoringOverlaps to ATen (#63505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63505

This isn't a public operator, just a helper function used in CUDA_tensor_apply.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30441305

Pulled By: ngimel

fbshipit-source-id: 84fabc701cbd8479e02d80f373a3dd62d70df2ce
2021-08-24 18:50:28 -07:00
Brian Hirsh
8bc2ba3fe3 detect missing kernels from external backends in codegen (#60737)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60737

Test Plan: Imported from OSS

Reviewed By: ezyang, jdonald

Differential Revision: D29392615

Pulled By: bdhirsh

fbshipit-source-id: d49d013243dbc8c8b55fbdb0b9b3eed38df52255
2021-07-08 15:33:04 -07:00
Brian Hirsh
7318747a3b move all external kernels into a class for better compiler error messages (#59839)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59839

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29047680

Pulled By: bdhirsh

fbshipit-source-id: 18cf4124be440a0a343b5983e1a4165db808e7c1
2021-07-08 15:31:02 -07:00
Brian Hirsh
9134b0e42f add a boxed CPU fallback kernel (#58065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58065

This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback.

Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests.

### Design

To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes:
* Confirm whether or not we can remove all C++ logging info directly in the yaml.

**Current Design**

All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1).

There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels.

```
// xla_cpu_fallback.h
#include <ATen/native/CPUFallback.h>
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
...
```
```
// xla_cpu_fallback.cpp
#include "torch_xla/csrc/aten_cpu_fallback.h"
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
  // Do custom logging here
  ...
  // Call the actual boxed CPU fallback.
  at::native::cpu_fallback(op, stack);
}

TORCH_LIBRARY_IMPL(_, XLA, m) {
  m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>());
}
```

Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.:
```
#include <ATen/native/CPUFallback.h>

at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);
  }
  ...
}
```

That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands.

**Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?**
We could change the api to use `at::redispatch`, which would make it look something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha);
  }
  ...
}
```
Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though!

Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
  ....
  if (...call_fallback...) {
    return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha);
  }
  ...
}
```

Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out.

**More alternatives**
The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides:
* Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback.
* Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later.

To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback.

### Performance impact

While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer.

I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind.

I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does.

`at::add`:
before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001
after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273
delta: ~15.5% increase

`at::add_out`:
before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz
after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227
delta: ~14.5% increase

High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case.

For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is:
* An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add)
* An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback)
* An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel).
* unboxing->boxing->unboxing logic (this is the only strictly required piece)

There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](https://github.com/pytorch/pytorch/issues/55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later.

Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere).

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28833085

Pulled By: bdhirsh

fbshipit-source-id: 537ebd5d7fb5858f1158764ff47132d503c3b92b
2021-06-25 16:26:50 -07:00
Brian Hirsh
e341bab8ae bugfix: ensure that at::{dispatch_key}:: API gets external linkage (#58569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58569

This should allow external C++ files that aren't compiled into `libtorch.so`/`libtorch_cpu.so` (including all of fbcode) to use fast path functions like `at::cpu::add()`, which skip the dispatcher.

So, after spending way too much time trying to figure out why I was getting linker errors when calling `at::meta::{op}` and `at::cpu::{op}` from C++ test files, I realized that we're not including the header files for C++ for the namespaced operator definitions. I.e. `RegisterCPU.cpp`, which provides definitions for the `at::cpu::{op}` fast path functions, wasn't including the `CPUFunctions.h` header.

Why that breaks stuff: the `CPUFunctions.h` header file is what marks each function with the `TORCH_API` macro, so without including it, when we build `libtorch.so` and `libtorch_cpu.so`, the compiler will look at the definition in `RegisterCPU.cpp`, not see a `TORCH_API`, and decide that the function should get internal linkage.

An alternative would be to directly mark the function definitions in `RegisterCPU.cpp` with `TORCH_API`, but this seemed cleaner.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711300

Pulled By: bdhirsh

fbshipit-source-id: 535f245c20e977ff566d6da0757b3cefa137040b
2021-06-15 16:53:22 -07:00
albanD
30a18fe318 refactor yaml loader import, no runtime change (#59850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59850

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063816

Pulled By: albanD

fbshipit-source-id: ca3067443d8e6282c1077d3dafa3b4f330d43b28
2021-06-12 06:58:34 -07:00
albanD
504ec30109 avoid error string formatting aten codegen 28s -> 23s (#59848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59848

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063818

Pulled By: albanD

fbshipit-source-id: c68734672eeacd212d7bd9bebe3d53aaa20c3c24
2021-06-12 06:58:31 -07:00
Brian Hirsh
be47060af9 [remove xla from codegen] rename aten_xla_type.h -> DispatchKeyNativeFunctions.h (#58568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58568

I split out the file rename into a separate commit to make the diff easier. The template file name is `aten_xla_type.h` -> `{DispatchKey}NativeFunctions.h`

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711298

Pulled By: bdhirsh

fbshipit-source-id: 2fa7d2abede560a2c577300f0b5a1f7de263d897
2021-05-26 12:53:19 -07:00
Brian Hirsh
86ce2950f6 remove xla-specific stuff from codegen (minus CPU fallback) (#58064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58064

**Summary**
This PR tries to remove all xla-specific logic from the codegen except for two places:
- renaming the `aten_xla_type.h/cpp` template files; Going to do that in a separate PR just to make the diff easier to understand
- CPU fallback logic (everything in `aten_xla_type_default.h/cpp` and `gen_external_aten_fallbacks.py`). I'm trying to kill all of that logic in a subsequent PR by making the CPU fallback a boxed kernel, so it felt unnecessary to go through it all and remove the xla references here.

**Notable changes**
The xla codegen includes some custom logging in each kernel wrapper, so I added a few new knobs to the external yaml, that we now test. I have a corresponding [xla-side PR](https://github.com/pytorch/xla/pull/2944) with the new yaml changes, which look like this:
```
per_op_log: XLA_FN_TRACK(3)
per_argument_log: TF_VLOG(3)
cpu_fallback_counter: XLA_COUNTER("aten::{name}", 1)
extra_headers: >
     #include <tensorflow/compiler/xla/xla_client/debug_macros.h>
     #include <tensorflow/compiler/xla/xla_client/metrics.h>
     #include <tensorflow/compiler/xla/xla_client/tf_logging.h>
     #include <torch_xla/csrc/function_call_tracker.h>
     #include <torch_xla/csrc/aten_xla_type.h>
     #include <torch_xla/csrc/aten_xla_type_default.h>
```

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28711095

Pulled By: bdhirsh

fbshipit-source-id: 90a48440f2e865a948184e2fb167ea240ada47bb
2021-05-26 12:52:13 -07:00
Brian Hirsh
1a9efbbc92 generate inplace/out kernels for xla (#57510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57510

This is a re-write of https://github.com/pytorch/pytorch/pull/56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
2021-05-17 12:25:38 -07:00
Brian Hirsh
9354a68e7d [codegen] split out backend-specific information from NativeFunction in the model (#57361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57361

Data model change in the codegen, which splits backend-specific information out of `NativeFunction`

### Overview
Currently in the codegen, native_functions.yaml has backend-specific information about each operator that is encoded directly into the data model, in the `NativeFunction` object. That's reasonable, since the native_functions.yaml is the source of truth for information about an operator, and the data model encodes that information into types.

Now that external backends can use the codegen though, that information is technically incomplete/inaccurate. In another PR, I tried patching the information on the `NativeFunction` object with the additional external information, by updating the `dispatch` entry to contain the external backend kernel name and dispatch key.

Instead, this PR tries to split out that information. The `NativeFunction` class contains all information about an operator from native_functions.yaml that's backend-independent and is known never to change regardless of what extra information backends provide. We also build up a backend "index", which is basically a mapping from [backend] -> [backend-specific-metadata]. Reading in an external backend yaml just involves updating that index with the new backend.

There were a few places where `NativeFunction` used the dispatch table directly, that I encoded as properties directly on the NativeFunction object (e.g. `is_abstract`). They were mostly around whether or not the operator has a composite kernel, which isn't something that's going to change for any external backends.

This has a few advantages:
- We can more easily re-use the existing logic in `native_function.py` and `register_dispatch_key.py` for both native and external backends, since they both involve a NativeFunction + a particular backend index
- The data in the data model will be the same regardless of how the codegen is run. Running the codegen with a new external backend doesn't change the data inside of NativeFunction or an existing backend index. It just adds a new index for that backend.
- There are several of codegen areas that don't care about backend-specific information: mostly the tracing and autograd codegen. We can reason about the codegen there more easily, knowing that backend-specific info is entirely uninvolved.

An alternative to this split would be to augment the NativeFunction objects with external backend information at the time that we create them. So the external codegen could read both native_functions.yaml and the external backend's yaml at the same time, and construct a NativeObject with a full dispatch table (including the XLA entry), and the correct setting of structured (taking into account both yamls). One disadvantage to this approach is that NativeFunction objects now contain different stuff depending on how you ran the codegen, and you have to make sure that any changes to the codegen can properly handle all the different variants.

### Data Model Changes
Removed 3 classes, which are used by the external codegen:
- ExternalBackendFunction
- ExternalBackendFunctionsGroup
- ExternalBackendMetadata

And added two new ones:
- BackendIndex
- BackendMetadata

`BackendIndex` contains any info that's specific to that backend, plus a mapping from operator names to backend specific metadata about the operator. One example of backend-specific info that's not operator-dependent is the fact that XLA prefers to implement functional kernels instead of out kernels (and so when they eventually mark an op as structured, they're going to mark the functional op and not the out op).

`BackendMetadata` contains info specific to an (operator, backend) pair. Right now, that's just (a) the name of the kernel, and (b) whether or not that operator is structured.

### Questions
I wanted to get this PR up earlier so I could get feedback, but there are a few things I want to call out:

**Dealing with `structured`.**
This PR separates out the notion of `structured` into two bits of information:
- Does [operator] have a meta() function. This is backend-agnostic, and is represented by the `structured` property on `NativeFunction`, same as before. This is used, e.g., to decide what signatures to add to `MetaFunctions.h`.
- Does [operator, backend] have an impl() function. This is backend dependent; even though technically all in-tree backends are forced to write impl() functions for an operator when we port the op to structured in native_functions.yaml, out-of-tree backends can decide to opt in independently. This is represented as a property on `BackendMetadata`. This is used in most other cases, e.g. in `RegisterDispatchKey` when we're deciding whether or not to gen a structured or unstructured wrapper.

I also baked `is_structured_dispatch_key` directly into each BackendIndex. So for operators marked "structured" in native_functions.yaml, their corresponding CPU/CUDA BackendIndex entries will be marked structured, and all others (except for potentially external backends) will not.

I ended up trying to deal with `structured` in this change since it's technically backend dependent (XLA can opt kernels into structured separately from in-tree ops), but that may have been too ambitious: it's technically not relevant until we actually add support for structured external kernels. If it's not clear that this is the right path for dealing with structured and we want to push that off, I'm fine with backing out the bits of this PR that make `structured` backend-dependent. I don't see anything *too* controversial related to structured in the change, but I tried to call out any areas in the comments

**Localizing the fact that external backends follow Dispatcher convention.**
Another thing that's sort of backend specific that I didn't totally address in this PR is the fact the fact that in-tree backends follow the Native API while external backends follow the Dispatcher API. I painted over that in `native_functions.py` by adding a helper, `kernel_signature`, that takes in a native function and gives you the "correct" signature for the specified backend- NativeSignature for in-tree backends, and DispatcherSignature for out-of-tree backends. In order to make that fully useable though, we'll need `NativeSignature` and `DispatcherSignature` to have matching interfaces. I didn't bother with that in this PR, which is why `gen_external_aten_fallbacks.py` still has a bunch of direct references to the dispatcher API. Thinking of adding it in a later PR but wanted to see if anyone has other opinions.

Maybe `is_external()` shouldn't even be a property on the BackendMetadata, and anything the codegen does that requires asking for that information should just be better abstracted away.

**Thoughts on the `BackendIndex` / `BackendMetadata` breakdown.**
One thing that's annoying right now is that to query for various pieces of metadata, you call helper functions like `backend_index.structured(f)`, which queries that particular backend and tells you if that specific NativeFunctionGroup is structured for that backend. It has to return an `Optional[bool]` though, since you have to handle the case where that operator doesn't have a kernel for that backend at all. So users of those helpers end up with a bunch of optionals that they need to unpack, even if they know at some point that the result isn't None. I think it would be easier instead to just store the NativeFunction object as a field directly on the BackendMetadata. Curious if there are any other opinions on a better way to model it though.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474362

Pulled By: bdhirsh

fbshipit-source-id: 41a00821acf172467d764cb41e771e096542f661
2021-05-17 12:25:35 -07:00
Brian Hirsh
3d9f10f530 [external codegen] better yaml error messaging, added explicit error message tests (#56597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56597

3 small changes, all centered around error messaging.

1) Improved error messages when `gen_backend_stubs.py` receives invalid yaml

2) Added error message tests. I wasn't sure if there was a canonical way to do this, so I just wrote a test that takes in a list of (yaml input, expected error message) pairs and runs the codegen pipeline on each of them.

3) I also removed the LineLoader from the yaml parsing bit that reads in the external backend yaml file. Two reasons that I took it out:
 - The main reason we use it with native_functions.yaml is to easily pinpoint problems with new ops as they're added, that the codegen can pick up. 99% of these problems have to do with schema, which is irrelevant to the external yaml since it pulls the schema from native_functions
 - Not all operators have to appear in the external yaml. We could do something like "line: -1", but that's kind of weird.

If you think the line numbers would actually be of more use than I'm thinking of in the external yaml though, let me know!

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474363

Pulled By: bdhirsh

fbshipit-source-id: 8b5ec804b388dbbc0350a20c053da657fad0474f
2021-05-17 12:25:29 -07:00
Brian Hirsh
76fbd755c1 Reland of "D27708346: generate xla codegen in-tree" (#56601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56601

Updating it to ensure that RegistrationDeclarations.yaml is completely
unchanged

This reverts commit 90e532f3ef.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27915305

Pulled By: bdhirsh

fbshipit-source-id: 491a025c44221690dad849f9a2166934130c0fec
2021-04-21 19:36:31 -07:00
Brian Hirsh
90e532f3ef Revert D27708346: generate xla codegen in-tree
Test Plan: revert-hammer

Differential Revision:
D27708346 (51d0212d0f)

Original commit changeset: 2289edd641f3

fbshipit-source-id: 86711c07db19833b9e772c558e12accba1432499
2021-04-21 11:07:45 -07:00
Brian Hirsh
51d0212d0f generate xla codegen in-tree (#55050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55050

not ready for review yet

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27708346

Pulled By: bdhirsh

fbshipit-source-id: 2289edd641f30277d7561cf2d48ec69c6a2137a9
2021-04-21 08:19:08 -07:00