Commit Graph

100 Commits

Author SHA1 Message Date
Will Constable
3842140fd5 Update lazy_ir.py from lazy_tensor_staging (#72730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72730

This diff contains changes from several PRs landed to lazy_tensor_staging branch.
- generating 'fallback' overrides for each codegenned op, useful for debugging
- supports operators which are missing aten:: symbols for op names, instead using their string counterpart
- makes the IR class a base class instead of hardcoding the assumption of TS

Test Plan: tested on lazy_tensor_staging branch

Reviewed By: desertfire

Differential Revision: D34178476

fbshipit-source-id: 7190b2e0d82b4eb1f4510c858c24446c6df3f9d0
(cherry picked from commit 6713d3f0ef)
2022-02-16 18:33:31 +00:00
francescocastelli
5e6f296612 Structured Kernel Precompute codegen handle fields without replacement (#71368)
Summary:
I've added the parsing of an optional first line in native_functions.yaml after the precomputed keyword for arguments that will be precomputed without replacement. This line is optional, must be the first and does not contain any arrow.

These new fields are precomputed as before in the meta function and added to the precompute struct returned by the meta function. For now I've put them as last args of the impl function where they can be reused.

example:

native_function.yaml:
```
  ...
  precomputed:
  - int numBatch, int numPlanes, int inputT, int inputH, int inputW   <- new
  - kernel_size -> int poolSizeT, int poolSizeH, int poolSizeW
  - output_size -> int outputT, int outputH, int outputW
```

meta:
```
TORCH_PRECOMPUTE_META_FUNC(fractional_max_pool3d)(
  const at::Tensor& input_,
  IntArrayRef pool_size,
  IntArrayRef output_size,
  const at::Tensor& randomSamples
) {
    ...

return TORCH_PRECOMPUTE_STRUCT(fractional_max_pool3d)().set_numBatch(numBatch).set_numPlanes(numPlanes).set_inputT(inputT).set_inputH(inputH).set_inputW(inputW)
  .set_poolSizeT(poolSizeT) ...
}
```

impl:
```
TORCH_IMPL_FUNC(fractional_max_pool3d_out_cpu)(
  const at::Tensor& input_,
  int64_t poolSizeT,
  int64_t poolSizeH,
  int64_t poolSizeW,
  int64_t outputT,
  int64_t outputH,
  int64_t outputW,
  const at::Tensor& randomSamples,
  const at::Tensor& output,
  const at::Tensor& indices,
  int64_t numBatch,    <- for now I've put them here
  int64_t numPlanes,
  int64_t inputT,
  int64_t inputH,
  int64_t inputW) {
```

Fixes https://github.com/pytorch/pytorch/issues/71314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71368

Reviewed By: zou3519

Differential Revision: D33683984

Pulled By: bdhirsh

fbshipit-source-id: 33066dd92b8743aadf0dc8102f6bf0689f843242
(cherry picked from commit 64e46af6a4)
2022-02-08 03:56:56 +00:00
Will Constable
397183f44c Add Lazy Tensor codegen infra (#69020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69020

Merges the lazy tensor codegen infra which has already been used on lazy_tensor_staging.

Test Plan: Test via lazy_tensor_staging branch

Reviewed By: alanwaketan, bdhirsh

Differential Revision: D32570613

fbshipit-source-id: 2cd5698644398bda69669683f8de79fd3b6639b5
2021-12-02 07:51:52 -08:00
Brian Hirsh
0032fa7725 Add a Functionalization pass in core (#64432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64432

Original PR description + feedback here: https://github.com/pytorch/pytorch/pull/63048

I've addressed all of the feedback in the original PR and made some pretty large changes, listed below.

**Table of Contents**
- Starting points
- List of the main changes from the original PR
- Next Steps
- Example codegen output (for a view, mutation, and view+mutation op)

**Starting Points**

A good place to start when looking through the PR:
* Alban mentioned that this is a useful mental model (thanks Ed for originally making this clear to me). Semantically, the pass currently does THREE things, which are all needed by functorch - all fused together into one big pass.
  * (a) alias removal, which replaces {view} calls with {view}_copy calls, and manually tracks aliasing information, so that when one tensor is mutated, we re-apply the same mutation to all of the aliases. This is the bulk of the work - once this is done, the next 2 things are trivial to implement.
  * (b) mutation removal, which is easy to do once we know that there are no aliases. Every mutation `a.add_(b)` becomes `a.replace_(a.add(b))`
  * (c) reapplying views: all of the `{view}_copy` calls are replaced with `{view}` calls again. This is an optimization that we can make specifically for functorch (and strided backends), that only care about mutation removal and not alias removal
  * XLA and Vulkan only want (a), or (a) + (b). Later, we'll want to split this out so that you can actually opt into different versions of this logic.
  * There is currently no {view}_copy replacement, because the pass just <replace views with copies> and <replace copies with views> steps have been combined. Later, we'll want to actually implement {view}_copy variants of each view operator, probably with codegen.
* documentation breadcrumb 1, in `FunctionalTensorWrapper.cpp`: https://github.com/pytorch/pytorch/pull/64432/files#diff-a0bac99bf205dba5b94cb64fc2466d3d55d991887572f9cd6a02e27b3a91dd60R59 (you might have to expand the `FunctionalTensorWrapper.cpp` file, which GitHub closes by default because it's large)
* documentation breadcrumb 2, in `FunctionalTensorWrapper.h`: https://github.com/pytorch/pytorch/pull/64432/files#diff-c945c71a4ccac65871f24a912e8904f9a5088b24a32e636727ea9c8fe920708aR12
* Reading through the codegen output at the bottom of this description.

**Main changes from the original PR**

(1)  I use lambdas instead of a giant enum to handle all of the different views.

This results in less boilerplate per view op (and more stuff that can be codegen'd). Every `ViewMeta` object now contains a `forward` and `reverse` lambda, that knows how to replay the view and its inverse. This makes the actual code that executes the replaying logic a lot less boilerplate-y (see `Alias::sync_update_operations` and `FunctionalTensorWrapper::sync_`)

(2) Every tensor during the functionalization pass is always wrapped in a `FunctionalTensorWrapper`.

This is potentially unnecessary for Vulkan/XLA, and will have a mild perf impact, but for now this PR just targets the functorch use case. I previously had a complicated design a (`FunctionalTensorImplBase` class) to avoid needing the wrapper for XLA, but it had some subtleties that are gonna require more thought to fix, so I'm pushing that off for now.

(3) `FunctionalTensorWrapper` objects accurately report stride information.

It's a little annoying to do this though, because the logic that calculates stride info for each view isn't easily separated from the actual view kernels in core, `at::native::{view}`. I do this by adding logic in each `at::functionalization::{view}` kernel to call the reference implementation `at::native::{view}`. I don't do anything with the output aside from taking it's size/stride/storage_offset to set the actual output tensor's size/stride/storage_offset correctly. There's another annoying part to this: I'm pretty sure that we want to pass in the actual *wrapper* tensors directly into the native kernels, not their inner unwrapped values. But there are some `at::native::{view}` kernels that call other tensor methods, which re-invokes the dispatcher, calling functionalization/functorch kernels that try do the unwrapping.

To do this, right now I have an `AutoDispatchDirectlyToNative` guard that basically ensures that any tensor methods called inside of the at::native::{view} op always redispatch straight to the CPU kernel (which will be another at::native:: kernel). This feels kind of heavy handed, but I'm not sure of a better way to do it.

(4) `FunctionalTensorWrapper` objects accurately report aliasing information.

There's a new `FunctionalStorageImpl` class (subclass of `StorageImpl`) that allows tensors in the functionalization pass to accurately alias storage. If two tensors `a` and `b` in a functionalized program are views of one another, then `a.storage.is_alias_of(b.storage)` should return true. I added this in a pretty similar way to how meta tensors allocate storage, although I don't pass in an actual allocator (I think this is fine because you should never resize a functional tensor's storage).

One thing I'm not sure about - should `FunctionalTensorWrapper` set `storage_access_should_throw_`: (a) always, (b) never, (c) only if its wrapped tensor has it set.

Right now I have it not set, mostly because calling the reference view functions (`at::native::{view}`) requires looking at the storage. But that means that if you try to access storage from python in a functionalized program, you'll get silent garbage instead of an error. Related question: are we planning on exposing meta tensor storage to python in the future (even though it contains garbage)?

(5) better docs :)

**View operator coverage**

(6) The functionalization pass now gets math-composite view ops for free.

I didn't add the `Functionalize` dispatch key to the composite set, because I don't want composite ops like `torch.ones` to get decomposed before hitting the functionalization pass. Instead, I added codegen to manually register the `at::native::` kernels of composite view ops. This is a little hairy, because the names of the `at::native::` kernels aren't easily accessible. They're stored in a `Dict[DispatchKey, BackendIndex]`. I made a best-effort attempt to get each view kernel's name, basically by assuming that every view op has either a composite or cpu implementation.
There's also a hardcoded list of composite view ops in `gen_inplace_or_view_type.py`, but it looks like it's wrong. This is probably worth rationalizing later, but instead I created a new list of the "complete" set of composite view ops, and preserved the old set by hardcoding the delta between the two sets.

(7) I've added codegen for ops that are both views AND mutations, like `transpose_()` (why do we even have these {emoji:1f622}).

From some light testing, it looks like they work correctly with one caveat: I had a hard time ensuring that functorch programs that mutate their inputs using ops like `transpose_()` preserve the input mutations after the program finishes running. For (in my corresponding functorch branch) I emit a warning when this happens, and just don't preserve the mutation

(8) I added `{view}_inverse` implementations for every view op, in `FunctionalInverses.cpp`.

These are needed to take mutations made to views and replay them back onto the base. To reduce boilerplate, the codegen generates function declarations for each `{view}_inverse` function, so you get a nice compiler error when someone eventually adds a new view op.

The only view ops currently not supported are (a) as_strided, and (b) the sparse view ops (values()/indices()).

I can add support for as_strided, but it needs an `as_strided_inverse()` function. That will look really similar to the `as_strided_backward()` function in FunctionsManual.cpp, but it has some noticeable differences: we basically want an `as_strided_embed` for autograd and `as_strided_scatter` for functionalization. We also will probably need them to be primitives w.r.t to autograd, since the currently implementation for autograd uses view().copy_() calls that XLA won't be able to handle. I'm wondering if anyone has any objections, but otherwise I can make those change (which will require writing backward formulas for `as_strided_embed` and `as_strided_scatter`).

I did a bunch of manual testing that all looks pretty good, but it's definitely not fully tested. Ed pointed out that once XLA uses this pass (or at least once there's a POC), we can just run the existing xla view test suite. Hopefully that delay is okay - if it's not, maybe we can think about using OpInfos similar to how functorch uses them for testing.

Note: there's some duplication with autograd's view code. Every `{view}_inverse` implementation is really similar to the implementation for that view listed in `derivatives.yaml`. There are some major differences though:
* the autograd implementations over those backwards functions (like `permute_backwards()`, in `FunctionsManual.cpp`) internally call other view ops. For functoinalization, we want them to (eventually call `{view}_copy` operators).
* For view ops that take a subset of the original storage, like `slice/select/diagonal/as_strided()`, the autograd backward functions fill the "spaces" in the inverse call with zeroes. For functionalizations, we want to fill them with the value of `base` at those positions. It looks like this currently applies to 6 total ops (since we can ignore composites):
  * select
  * slice
  * diagonal
  * as_stridied
  * split
  * split_with_sizes
A nice end state would probably be for the autograd + functoinalization codegen to both look at the same yaml (either `derivatives.yaml`, or something else), and automatically generate the right thing. I didn't leave that in scope for this PR though.

**Current State + Next Steps**

There are a bunch of followups after this PR eventually lands. Roughly in order:
* Use the current pass to register problematic composite ops in functorch. Also, nested `functionalize()` calls aren't supported yet (I mostly just need to remove some debug asserts and test it).
* Work on freeing up dispatch key space in the by deduplicating the `{backend}`/`Autograd{backend}`/`Sparse{backend}`/`Quantized{backend}` keys
* Once we have more dispatch keys, split up this pass into 3 pieces - it's currently fused, and doesn't do the right thing for vulkan/XLA. Specifically, all of the `{view}` calls in the current pass's view-replay logic should turn into `{view}_copy` calls that vulkan/XLA know how to implement, and there will be separate passes for (a) removing mutations, and (b) turning `{view}_copy` calls back into `{view}` calls. For Vulkan, we eventually want a pass that ONLY removes aliasing and view calls, and doesn't remove mutations. We can also probably make the 2 new passes user dispatch keys to save dispatch key space, if they'll only be used by functorch anyway.
* Do more of a dive on perf for the vulkan/xla use cases. There are several areas to improve perf with varying levels of effort required. The simplest one that I'll probably do regardless is to codegen the out-of-place kernels instead of using a boxed fallback. Getting a POC working for xla will also be useful to test the view operator coverage.

**Example Codegen Output**

View Op:
```
::std::vector<at::Tensor> split_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, int64_t split_size, int64_t dim) {

      auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
      ::std::vector<at::Tensor> out;
      {
        at::AutoDispatchBelowFunctionalize guard;
        auto tmp_output = at::redispatch::split(ks & c10::after_func_keyset, self_, split_size, dim);
        out = at::functionalization::impl::wrapFunctionalTensor(tmp_output);
        // I'm fusing the [alias removal], [mutation removal], [add views back] passes together.
        // Later, we'll want to turn them into separate passes (since e.g. vulkan only cares about alias removal).
      }

      at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
        [split_size, dim](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
          return base.split(split_size, dim)[mutated_view_idx];
        },
        [split_size, dim](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
          return at::functionalization::impl::split_inverse(base, mutated_view, mutated_view_idx, split_size, dim);
        }
      );
      at::functionalization::impl::set_view_meta(out, self, view_meta);

      at::AutoDispatchDirectlyToNative native_guard;
      ::std::vector<at::Tensor> reference_tensor_output = at::native::split(self, split_size, dim);
      at::functionalization::impl::set_strides(out, reference_tensor_output);
      return out;

}
```

Mutation Op:
```
at::Tensor & add__Tensor(c10::DispatchKeySet ks, at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {

      at::functionalization::impl::sync(self);
      at::functionalization::impl::sync(other);
      auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
      auto other_ = at::functionalization::impl::unwrapFunctionalTensor(other);
      at::Tensor tmp_output;
      {
          at::AutoDispatchBelowFunctionalize guard;
          // The functionalization pass explicitly doesn't pass out= parameters to the redispatch
          tmp_output = at::redispatch::add(
            ks & c10::after_func_keyset, self_, other_, alpha);
      }

      self.replace_(tmp_output);
      at::functionalization::impl::maybe_add_update(self);
      return self;
}
```

View + Mutation Op:
```
at::Tensor & transpose_(c10::DispatchKeySet ks, at::Tensor & self, int64_t dim0, int64_t dim1) {

      at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
        [dim0, dim1](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
          return base.transpose(dim0, dim1);
        },
        [dim0, dim1](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
          return at::functionalization::impl::transpose_inverse(base, mutated_view, dim0, dim1);
        }
      );
      at::functionalization::impl::mutate_view_meta(self, view_meta);
      // See  Note [Propagating strides in the functionalization pass]
      // Directly update the sizes/strides/storage_offset fields on self using the inplace call.
      // I need the guard because I don't want the at::native kernel to end up calling more functionalization/functorch kernels.
      // Its only job is to directly compute the output size/stride/storage_offset metadata.
      at::AutoDispatchDirectlyToNative native_guard;
      at::native::transpose_(self, dim0, dim1);
      return self;

}
```

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942093

Pulled By: bdhirsh

fbshipit-source-id: b95598dae35dd1842fa8b1d8d1448332f3afaadf
2021-10-28 10:51:17 -07:00
Brian Hirsh
665c148e42 move some codegen utilities into utils.py (#63094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63094

This PR:
- Moves `FileManager` and its dependencies (`assert_never` and other imports) to `utils.py`, and updates all of the call-sites with the fresh imports
- Passes the list of NativeFunction objects into `gen_trace_type` directly, instead of requiring the function to regenerate it (we already have it)

The purpose of the reshuffling is to avoid circular dependencies in the next PR, where I add codegen for the functionalization pass, which gets called from `gen.py` (but depends on some stuff from the autograd codegen - in partulcar, the list of view ops).

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942096

Pulled By: bdhirsh

fbshipit-source-id: 36118facae61f25f8922bb43ad2818c80b53504e
2021-10-28 10:49:17 -07:00
Edward Yang
ece0221854 Rename int to long, add more C++ types. (#66108)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66108

BC-breaking change: intT is now longT (which aligns it more accurately with how
the types are referred to in C++).  The benefit for this is we can idiomatically
express all C++ dtypes (with intT now mapping to int32_t).  These types are needed
for ufunc codegen in a latter patch.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31385761

Pulled By: ezyang

fbshipit-source-id: ec6f3a0953794313470dbe14911f23ac116be425
2021-10-08 08:25:06 -07:00
Ivan Yashchuk
53c0d91db9 Make autograd codegen for differentiable outputs safer to use (#65823)
Summary:
This PR adds raising an error when `len(output_differentiability) != len(outputs)`

Notes in derivatives.yml tell that
> 'output_differentiability' and value a list of the same length as the number of outputs from the forward function.

but it was not enforced in codegen leading to confusion and unexpected bugs https://github.com/pytorch/pytorch/issues/65061#issuecomment-930271126.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65823

Reviewed By: mrshenli

Differential Revision: D31307312

Pulled By: albanD

fbshipit-source-id: caeb949e9249310dffd237e77871e6d0d784e298
2021-10-01 07:27:57 -07:00
Michael Dagitses
543185a0fd support using gradients named for outputs in derivatives (#63947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63947

Fixes #62196

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30541485

Pulled By: dagitses

fbshipit-source-id: ea1dd0edd1a51936a295631e52b85e9c022a9c87
2021-09-18 07:31:45 -07:00
Meghan Lele
968d7ee46a [structured] Preserve computed elements from meta func to impl (#61746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61746

**Summary**
This commit introduces a new feature for structured kernels that allows
kernels to declare quantities as "precomputed" in
`native_functions.yaml`, compute them once in the `meta` function and
reuse them again in the `impl`. The names and types of these quantities
are used to generate code for a struct containing them that the `meta`
function must return. In the case of a handful of surveyed kernels
(`all,`, `any`, `avg_pool2d`), these quantities that are used both in
the `meta` and `impl` have the same meaning as certain kernel arguments
and in fact supersede them. Accordingly, the correspondence between a
kernel argument and the precomputed elements that supersede it is also
captured in `native_functions.yaml`. This information is used to unpack
the struct returned by `meta` and pass its contents correctly to the
`impl` function.

The primary goal is to avoid recompute and enhance developer experience
(e.g. sometimes people can forget to compute these elements while
porting a kernel).

Test Plan: Imported from OSS

Reviewed By: tugsbayasgalan

Differential Revision: D30407831

Pulled By: SplitInfinity

fbshipit-source-id: 00975525ea373721fe52d06f75cd4ac91f3dc556
2021-09-01 14:34:25 -07:00
Richard Zou
389380ffcc [reland] Refactor Tensor::to to call a primitive that is not copy_. (#62262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62262

Context
-------
functorch is unable to vmap(grad(f)) when f contains a .to
call. This is because .to (when it is not a no-op) decomposes
to .copy_ under grad and the .copy_ is not compatible with vmap.

Fix
 ---
The fix for this is to have all Tensor::to variants call a new operator,
`_to_copy`, that always copies and is a primitive w.r.t. autograd so
that autograd decomposes Tensor::to into a call to `_to_copy`.
(This is related to https://github.com/pytorch/pytorch/issues/60956,
please let me know if you want to bikeshed the naming).

In order to get this done I had to do a bit of refactoring. All of the
`::to` implementations now call `to_impl` which may call `_to_copy`.

Autograd codegen changes
------------------------

The second thing I had to do was modify the autograd codegen. Right now,
autograd assumes that every output is either statically known to be
differentiable or not differentiable at codegen time. `_to_copy` is a
little special because its differentiability depends on the output
dtype. e.g. `torch.randn(3, requires_grad=True).to(torch.long)` is non
differentiable. To get this to work:
- I changed how `output_differentiability` in derivatives.yaml work.
- output_differentiability can now accept "conditions" for each of the
output arguments. A "condition" is some C++ code.
- We currently only support `output_differentiability` with conditions
if there is a single output. This is for convenience and can be changed
in the future.
- I added a new `output_differentiability_conditions` field to
DifferentiabilityInfo. This gets populated in load_derivatives.yaml
- forward-mode and reverse-mode AD take
`output_differentiability_conditions` into account.

Here's how the generated code for `VariableType::_to_copy`
[looks
like](https://gist.github.com/zou3519/93462df4bda1837acee345205b7cc849)
No other autogenerated code gets modified by this PR.

Performance benchmarking
------------------------
- I benchmarked [three
cases that demonstrate overhead](https://gist.github.com/zou3519/5b6985e6906b80eec5a0dd94ed5b6a1a).
- Case A: No-op .to(). Instruction count went from 50223 to 25623. I
have no clue why but this is a good thing.
- Case B: not-no-op .to(). Instruction count went from 665291 to 671961.
This is expected; `_to_copy` adds an additional dispatch.
- Case C: not-no-op .to() forward pass and backward pass. Instruction count
went from 4022841 to 4030057. This PR adds
an additional dispatch to .to() (so there should be one additional
dispatch in the forward pass) so this number looks reasonable.

Test Plan
---------
- test_torch.py has a test_to
- test_cuda.py has test_to*
- test_autograd has tests (test_type_conversions) that exercise the
reverse-mode path
- test_ops.py has some tests (like log_softmax) that exercise the
reverse-mode and forward-mode AD path.
- test_quantization, test_namedtensor all exercise tensor.to as well.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29934998

Pulled By: zou3519

fbshipit-source-id: 820069acd66fd5af97b98f42edfca68572c9eb1c
2021-07-29 10:49:32 -07:00
albanD
4a36e2a223 Add forward AD inplace check and fix codegen (#60498)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60498

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D29914593

Pulled By: albanD

fbshipit-source-id: bde649d5a03639a240dfe5fe027c6a3f758428a4
2021-07-27 13:04:55 -07:00
Nikita Shulga
478098aaac Revert D29801652: Refactor Tensor::to to call a primitive that is not copy_.
Test Plan: revert-hammer

Differential Revision:
D29801652 (29bb3f4647)

Original commit changeset: bb01eb1acf3d

fbshipit-source-id: 93693bad8068d47a3a4c16f34f300e03ea573897
2021-07-26 19:37:17 -07:00
Richard Zou
29bb3f4647 Refactor Tensor::to to call a primitive that is not copy_. (#61458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61458

Context
-------
functorch is unable to vmap(grad(f)) when f contains a .to
call. This is because .to (when it is not a no-op) decomposes
to .copy_ under grad and the .copy_ is not compatible with vmap.

Fix
 ---
The fix for this is to have all Tensor::to variants call a new operator,
`_to_copy`, that always copies and is a primitive w.r.t. autograd so
that autograd decomposes Tensor::to into a call to `_to_copy`.
(This is related to https://github.com/pytorch/pytorch/issues/60956,
please let me know if you want to bikeshed the naming).

In order to get this done I had to do a bit of refactoring. All of the
`::to` implementations now call `to_impl` which may call `_to_copy`.

Autograd codegen changes
------------------------

The second thing I had to do was modify the autograd codegen. Right now,
autograd assumes that every output is either statically known to be
differentiable or not differentiable at codegen time. `_to_copy` is a
little special because its differentiability depends on the output
dtype. e.g. `torch.randn(3, requires_grad=True).to(torch.long)` is non
differentiable. To get this to work:
- I changed how `output_differentiability` in derivatives.yaml work.
- output_differentiability can now accept "conditions" for each of the
output arguments. A "condition" is some C++ code.
- We currently only support `output_differentiability` with conditions
if there is a single output. This is for convenience and can be changed
in the future.
- I added a new `output_differentiability_conditions` field to
DifferentiabilityInfo. This gets populated in load_derivatives.yaml
- forward-mode and reverse-mode AD take
`output_differentiability_conditions` into account.

Here's how the generated code for `VariableType::_to_copy`
[looks
like](https://gist.github.com/zou3519/93462df4bda1837acee345205b7cc849)
No other autogenerated code gets modified by this PR.

Performance benchmarking
------------------------
- I benchmarked [three
cases that demonstrate overhead](https://gist.github.com/zou3519/5b6985e6906b80eec5a0dd94ed5b6a1a).
- Case A: No-op .to(). Instruction count went from 50223 to 25623. I
have no clue why but this is a good thing.
- Case B: not-no-op .to(). Instruction count went from 665291 to 671961.
This is expected; `_to_copy` adds an additional dispatch.
- Case C: not-no-op .to() forward pass and backward pass. Instruction count
went from 4022841 to 4030057. This PR adds
an additional dispatch to .to() (so there should be one additional
dispatch in the forward pass) so this number looks reasonable.

Test Plan
---------
- test_torch.py has a test_to
- test_cuda.py has test_to*
- test_autograd has tests (test_type_conversions) that exercise the
reverse-mode path
- test_ops.py has some tests (like log_softmax) that exercise the
reverse-mode and forward-mode AD path.
- test_quantization, test_namedtensor all exercise tensor.to as well.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29801652

Pulled By: zou3519

fbshipit-source-id: bb01eb1acf3d79d84f284150d1be4be3b4ace351
2021-07-26 13:02:39 -07:00
Meghan Lele
1d2ea76afb clamp: port to structured kernel (#61361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61361

This PR ports the `clamp` kernel to the structured format. In addition, it introduces `OptionalScalarRef` as a replacement for `c10::optional<Scalar>&`. The latter, although it is a reference type, can still involve copying the contained `Scalar` (e.g. if the actual parameter is a `Scalar` or if a `c10::optional<Scalar>` is constructed just to call a kernel). `OptionalScalarRef` contains only a `const Scalar&`, and stores flag about whether the instance contains something inside the `Scalar` itself using a new tag.

For more information, see #55070.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29821533

Pulled By: SplitInfinity

fbshipit-source-id: 88d55df5a4b2c14b68a57e4905d90eea1b088d99
2021-07-23 02:02:07 -07:00
Meghan Lele
1c80b5220b nll_loss_forward: port to structured kernel (#61443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61443

For more information, see #55070.

This PR also adds a new type, `OptionalTensorRef` as a replacement for `c10::optional<Tensor>&` in order to avoid the reference count manipulations that are inevitable with the latter. I have confirmed using Godbolt/Compiler Explorer that this class does indeed avoid manipulating the reference count of the `intrusive_ptr` inside the `Tensor` it refers to:

1. [P429709479](https://www.internalfb.com/phabricator/paste/view/P429709479) - Given a `const Tensor&` in scope, an `OptionalTensorRef` can be constructed without bumping refcount.
2. [P429709883](https://www.internalfb.com/phabricator/paste/view/P429709883) - Given an `OptionalTensorRef`, a `const Tensor&` can be produced without bumping refcount.
3. [P429710335](https://www.internalfb.com/phabricator/paste/view/P429710335) - When `OptionalTensorRef` is destructed, the refcount should not be decremented.
4. [P429769525](https://www.internalfb.com/phabricator/paste/view/P429769525) - `OptionalTensorRef` can be assigned without refcount manipulation.
5. [P429769882](https://www.internalfb.com/phabricator/paste/view/P429769882) - `OptionalTensorRef` can be move assigned without refcount manipulation.

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D29780666

Pulled By: SplitInfinity

fbshipit-source-id: 7af157215300e9254d635433cbd583f7329fe064
2021-07-20 11:45:44 -07:00
Edward Yang
3ad3f20bff Add an optional Device parameter to pin_memory/is_pinned that does nothing (#60201)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60201

This is to flush out BC/FC problems with adding this parameter.  Later
PR will actually add the desired functionality.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D29331880

Pulled By: ezyang

fbshipit-source-id: 6036716d6ae55e6ea7ef2348b6c34a39613c8dd5
2021-06-28 10:38:52 -07:00
Brian Hirsh
7bc86458e1 Revert "Revert D28833086: beef up at::_ops API" (#60214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60214

Relanding this PR, but with a fix for windows cuda builds (example failure in master here: https://github.com/pytorch/pytorch/runs/2852662871)

This is identical to the original PR except for one change in `tools/codegen/gen.py`: `static constexpr` -> `static CONSTEXPR_EXCEPT_WIN_CUDA`

This actually took a while to figure out, until I tracked down a previous pytorch PR that encountered a similar issue: https://github.com/pytorch/pytorch/pull/40675

This reverts commit 6d0fb85a62.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29213932

Pulled By: bdhirsh

fbshipit-source-id: b90c7c10e5a51f8d6173ddca673b418e5774c248
2021-06-24 18:08:54 -07:00
Saketh Are
729f7cd52f Implement histogram operator on CPU (#58780)
Summary:
The existing [torch.histc](https://pytorch.org/docs/stable/generated/torch.histc.html) operator is limited in comparison to [numpy.histogram](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html). This PR adds torch.histogram on CPU. The new operator replicates numpy.histogram's behavior, including support for caller-specified bin edges and weights. It was motivated by previous community requests for histogram.

The implementation was [benchmarked](https://docs.google.com/spreadsheets/d/1xCR0jODchVvwdVSAjiLsNCkmyictA6j1LNfDpWOafjw/edit?usp=sharing) against numpy.histogram as well as torch.histc. This implementation is weakly faster than numpy.histogram across all types of inputs tested, and performs in line with torch.histc for the limited inputs histc supports.

mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58780

Test Plan:
Added unit tests, OpInfo for the new torch.histogram operator.

Tested execution time on a variety of input sizes and compared to numpy.histogram performance: https://docs.google.com/spreadsheets/d/1xCR0jODchVvwdVSAjiLsNCkmyictA6j1LNfDpWOafjw/edit?usp=sharing

Reviewed By: ezyang

Differential Revision: D29134626

Pulled By: saketh-are

fbshipit-source-id: f2773085de1697f6bc6ffdeffe9a81267f51bdfc
2021-06-22 10:06:04 -07:00
Brian Hirsh
6d0fb85a62 Revert D28833086: beef up at::_ops API
Test Plan: revert-hammer

Differential Revision:
D28833086 (e2129d1c06)

Original commit changeset: 55f322a8378c

fbshipit-source-id: e55bf812ec411bb6bee87654f1d65ff10c046106
2021-06-17 14:28:32 -07:00
Brian Hirsh
e2129d1c06 beef up at::_ops API (#59115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59115

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator.

### Changes
For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`.

Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`):
```
struct TORCH_API add_Tensor {
  using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &);
  using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &);
  static constexpr const char* name = "aten::add";
  static constexpr const char* overload_name = "Tensor";
  static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor";
  static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha);
  static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot
};
```

What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`.

### Motivation

There were two motivations for this change:

**Codegen refactor**
The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`.

This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files.

**Extra compile-time metadata**
In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them.

The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples)
```
return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0);
```

**Characteristics of the `at::_ops` API**
(I also commented this in the codegen)

(1) It follows the Dispatcher API.

This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again.

(2) Overload names are disambiguated.

This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call)

(3) No argument defaulting is allowed.

This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though!

(4) manual_cpp_bindings and faithful names are not included in the API.

I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API).

**Details**

Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway.

The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to:
- Not allow defaulting in the `at::_ops` API
- Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`)

It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere.

Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`.

It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header.

**Perf**
I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined.

There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D28833086

Pulled By: bdhirsh

fbshipit-source-id: 55f322a8378cb9a3cb6642f72aa291be381dd95b
2021-06-17 13:09:46 -07:00
Kurt Mohler
fe8e5eb260 Change native functions to take c10::string_view args instead of std::string (#57680)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57680

Reviewed By: malfet

Differential Revision: D28511799

Pulled By: ezyang

fbshipit-source-id: 43142f994d048b28b3279ccdb7a28cbaa3190973
2021-05-20 18:15:45 -07:00
Brian Hirsh
1a9efbbc92 generate inplace/out kernels for xla (#57510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57510

This is a re-write of https://github.com/pytorch/pytorch/pull/56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
2021-05-17 12:25:38 -07:00
Brian Hirsh
9354a68e7d [codegen] split out backend-specific information from NativeFunction in the model (#57361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57361

Data model change in the codegen, which splits backend-specific information out of `NativeFunction`

### Overview
Currently in the codegen, native_functions.yaml has backend-specific information about each operator that is encoded directly into the data model, in the `NativeFunction` object. That's reasonable, since the native_functions.yaml is the source of truth for information about an operator, and the data model encodes that information into types.

Now that external backends can use the codegen though, that information is technically incomplete/inaccurate. In another PR, I tried patching the information on the `NativeFunction` object with the additional external information, by updating the `dispatch` entry to contain the external backend kernel name and dispatch key.

Instead, this PR tries to split out that information. The `NativeFunction` class contains all information about an operator from native_functions.yaml that's backend-independent and is known never to change regardless of what extra information backends provide. We also build up a backend "index", which is basically a mapping from [backend] -> [backend-specific-metadata]. Reading in an external backend yaml just involves updating that index with the new backend.

There were a few places where `NativeFunction` used the dispatch table directly, that I encoded as properties directly on the NativeFunction object (e.g. `is_abstract`). They were mostly around whether or not the operator has a composite kernel, which isn't something that's going to change for any external backends.

This has a few advantages:
- We can more easily re-use the existing logic in `native_function.py` and `register_dispatch_key.py` for both native and external backends, since they both involve a NativeFunction + a particular backend index
- The data in the data model will be the same regardless of how the codegen is run. Running the codegen with a new external backend doesn't change the data inside of NativeFunction or an existing backend index. It just adds a new index for that backend.
- There are several of codegen areas that don't care about backend-specific information: mostly the tracing and autograd codegen. We can reason about the codegen there more easily, knowing that backend-specific info is entirely uninvolved.

An alternative to this split would be to augment the NativeFunction objects with external backend information at the time that we create them. So the external codegen could read both native_functions.yaml and the external backend's yaml at the same time, and construct a NativeObject with a full dispatch table (including the XLA entry), and the correct setting of structured (taking into account both yamls). One disadvantage to this approach is that NativeFunction objects now contain different stuff depending on how you ran the codegen, and you have to make sure that any changes to the codegen can properly handle all the different variants.

### Data Model Changes
Removed 3 classes, which are used by the external codegen:
- ExternalBackendFunction
- ExternalBackendFunctionsGroup
- ExternalBackendMetadata

And added two new ones:
- BackendIndex
- BackendMetadata

`BackendIndex` contains any info that's specific to that backend, plus a mapping from operator names to backend specific metadata about the operator. One example of backend-specific info that's not operator-dependent is the fact that XLA prefers to implement functional kernels instead of out kernels (and so when they eventually mark an op as structured, they're going to mark the functional op and not the out op).

`BackendMetadata` contains info specific to an (operator, backend) pair. Right now, that's just (a) the name of the kernel, and (b) whether or not that operator is structured.

### Questions
I wanted to get this PR up earlier so I could get feedback, but there are a few things I want to call out:

**Dealing with `structured`.**
This PR separates out the notion of `structured` into two bits of information:
- Does [operator] have a meta() function. This is backend-agnostic, and is represented by the `structured` property on `NativeFunction`, same as before. This is used, e.g., to decide what signatures to add to `MetaFunctions.h`.
- Does [operator, backend] have an impl() function. This is backend dependent; even though technically all in-tree backends are forced to write impl() functions for an operator when we port the op to structured in native_functions.yaml, out-of-tree backends can decide to opt in independently. This is represented as a property on `BackendMetadata`. This is used in most other cases, e.g. in `RegisterDispatchKey` when we're deciding whether or not to gen a structured or unstructured wrapper.

I also baked `is_structured_dispatch_key` directly into each BackendIndex. So for operators marked "structured" in native_functions.yaml, their corresponding CPU/CUDA BackendIndex entries will be marked structured, and all others (except for potentially external backends) will not.

I ended up trying to deal with `structured` in this change since it's technically backend dependent (XLA can opt kernels into structured separately from in-tree ops), but that may have been too ambitious: it's technically not relevant until we actually add support for structured external kernels. If it's not clear that this is the right path for dealing with structured and we want to push that off, I'm fine with backing out the bits of this PR that make `structured` backend-dependent. I don't see anything *too* controversial related to structured in the change, but I tried to call out any areas in the comments

**Localizing the fact that external backends follow Dispatcher convention.**
Another thing that's sort of backend specific that I didn't totally address in this PR is the fact the fact that in-tree backends follow the Native API while external backends follow the Dispatcher API. I painted over that in `native_functions.py` by adding a helper, `kernel_signature`, that takes in a native function and gives you the "correct" signature for the specified backend- NativeSignature for in-tree backends, and DispatcherSignature for out-of-tree backends. In order to make that fully useable though, we'll need `NativeSignature` and `DispatcherSignature` to have matching interfaces. I didn't bother with that in this PR, which is why `gen_external_aten_fallbacks.py` still has a bunch of direct references to the dispatcher API. Thinking of adding it in a later PR but wanted to see if anyone has other opinions.

Maybe `is_external()` shouldn't even be a property on the BackendMetadata, and anything the codegen does that requires asking for that information should just be better abstracted away.

**Thoughts on the `BackendIndex` / `BackendMetadata` breakdown.**
One thing that's annoying right now is that to query for various pieces of metadata, you call helper functions like `backend_index.structured(f)`, which queries that particular backend and tells you if that specific NativeFunctionGroup is structured for that backend. It has to return an `Optional[bool]` though, since you have to handle the case where that operator doesn't have a kernel for that backend at all. So users of those helpers end up with a bunch of optionals that they need to unpack, even if they know at some point that the result isn't None. I think it would be easier instead to just store the NativeFunction object as a field directly on the BackendMetadata. Curious if there are any other opinions on a better way to model it though.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474362

Pulled By: bdhirsh

fbshipit-source-id: 41a00821acf172467d764cb41e771e096542f661
2021-05-17 12:25:35 -07:00
albanD
c711c30c74 Revert "Revert D28387764: Codegen inplace forward AD formula from out of place one if needed" (#58231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58231

This reverts commit 066e7699eb.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D28412495

Pulled By: albanD

fbshipit-source-id: 97dd4580baac903805ab66ad55fe9570dec993ee
2021-05-14 08:35:38 -07:00
Edward Yang
066e7699eb Revert D28387764: Codegen inplace forward AD formula from out of place one if needed
Test Plan: revert-hammer

Differential Revision:
D28387764 (2279962162)

Original commit changeset: 7bf3929dd214

fbshipit-source-id: 473851cf7527b0edf303fdb46b9c07357ff7f340
2021-05-12 20:35:02 -07:00
albanD
2279962162 Codegen inplace forward AD formula from out of place one if needed (#57767)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57767

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28387764

Pulled By: albanD

fbshipit-source-id: 7bf3929dd21425be653da112385e902aa50455a1
2021-05-12 18:49:20 -07:00
Brian Hirsh
76fbd755c1 Reland of "D27708346: generate xla codegen in-tree" (#56601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56601

Updating it to ensure that RegistrationDeclarations.yaml is completely
unchanged

This reverts commit 90e532f3ef.

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27915305

Pulled By: bdhirsh

fbshipit-source-id: 491a025c44221690dad849f9a2166934130c0fec
2021-04-21 19:36:31 -07:00
Scott Wolchok
1211bccc65 [PyTorch] Fix const correctness for resize native functions (#55351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55351

We incorrectly used `Tensor&` to mean "the underlying
TensorImpl cannot be changed", as explained in
https://github.com/zdevito/ATen/issues/27#issuecomment-330717839 .
This diff gets us on the path to fixing this problem: we have an
incremental way to fix individual native functions so that we can
apply any handwritten fixes a few at a time. It gets the migration
started with the `resize` family of native functions.
ghstack-source-id: 127092677

Test Plan: fitsships

Reviewed By: ezyang

Differential Revision: D27583983

fbshipit-source-id: 4eeeec85f5d268e9d0f1645eb9396914a9f9557f
2021-04-21 14:51:41 -07:00
Brian Hirsh
90e532f3ef Revert D27708346: generate xla codegen in-tree
Test Plan: revert-hammer

Differential Revision:
D27708346 (51d0212d0f)

Original commit changeset: 2289edd641f3

fbshipit-source-id: 86711c07db19833b9e772c558e12accba1432499
2021-04-21 11:07:45 -07:00
Brian Hirsh
51d0212d0f generate xla codegen in-tree (#55050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55050

not ready for review yet

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27708346

Pulled By: bdhirsh

fbshipit-source-id: 2289edd641f30277d7561cf2d48ec69c6a2137a9
2021-04-21 08:19:08 -07:00
Brian Hirsh
eca98fedb5 split out NamedCType from CType. Remove direct string comparison from autograd codegen (#55334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55334

The goal of this PR is to clean up some of the autograd codegen to compare C++ types using `CType` objects instead of raw strings. My last PR in the stack made that string comparison a little more fragile, since the raw C++ strings needed to be namespace-aware.

I confirmed byte-for-byte no codegen changes vs. the last PR (which added namespaces to the codegen) by running `diff -qr ../pytorch-common_test/torch/csrc/autograd/generated/ ../pytorch-callgrind_test_after2/torch/csrc/autograd/generated/` and `diff -qr ../pytorch-common_test/build/aten/src/ATen/ ../pytorch-callgrind_test_after2/build/aten/src/ATen/`

Note that a better end-state for the autograd codegen would be to do all of its type pattern matching directly off of JIT types, instead of off of CType’s (which are really just generated from JIT types, incorporating C++ specific semantics). That looks like it’ll require a pretty substantial change though, so I’m not doing it in this PR.

As part of this change (and after talking with ezyang), I split off the `CType` data class into a separate `NamedCType` class, which holds a name and a `CType`. This way, `CType` only knows about actual C++ types, making it easier to compare CType’s to each other in the codegen when we only care about the type. The core change is in `types.py`, but it required a bunch of downstream changes to update all of the places where we create `CType`s to create `NamedCType`s instead.

The main change in the autograd codegen was that I updated `SavedAttribute` to store a `NamedCType`. The other autograd changes all pretty much came from that change.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27708347

Pulled By: bdhirsh

fbshipit-source-id: 3e07c80569c7b229c638f389e76e319bff6315f9
2021-04-16 11:43:08 -07:00
Brian Hirsh
947c7a8215 add C++ namespacing logic to ctypes (#55047)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55047

Added namespaces to all of the `CTypes` printed in the codegen. This is pretty much required if we want to use codegen externally, since we can no longer assume that we're inside of the `at::` namespace.

Important changes are in `types.py`.

How do we add the notion of namespaces to C++ types without people having to write "at::Tensor" everywhere? Before this PR, `CType` held a raw string representing the type, i.e. `BaseCType("Tensor", binds)`. This PR introduces a set of singleton base C++ types in `types.py`, that know how to print their namespace. Instead, we'd write `BaseCType(tensorT, binds)`, where printing `tensorT` will properly print out "at::Tensor".

This also means that you can't create arbitrary `CTypes`. If we need a new C++ type in the codegen, we need to add it to the list in `types.py`.

One blip in the design: we don't want to change `RegistrationDeclarations.yaml`, since that'll break external backends that ingest it. I added separate functions to display types without the namespace that are used to create RegistrationDeclarations.yaml`. With an external codegen API though, we can eventually kill it :)

I also didn't realize until this PR that `Declarations.yaml` is still directly in use, by some python/autograd codegen. Rather than keep that yaml byte-for-byte compatible, I just updated the callsites in the autograd codegen to work with namespaces. In the NEXT pr, I try to clean up some of the autograd codegen to stop using raw strings to match against C++ types.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27708349

Pulled By: bdhirsh

fbshipit-source-id: 56a4f81fc101795bcb9ee1f722121480fb2356ad
2021-04-16 11:43:06 -07:00
Brian Hirsh
164bee1d09 Return a CType instead of a string for returns, beef up CType (#55046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55046

Updating `returns` in the codegen to return a CType instead of a raw string.

This has benefit of putting all stringifying logic through CType, which is useful in the followup PR when I add namespaces.

I also added new CTypes for other templated C++ types: array, vector and tuple. Mostly because it makes the namespacing logic in the next PR significantly easier. It also seems more natural to me that `BaseCType` shouldn't represent specializations of templated types.

There's a little bit of weirdness, types that are currently *only* used for returns, i.e. `TupleCType`. Returns aren't named, so I opted not to give it one- so we can add it in later if we discover that we need it.

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27708348

Pulled By: bdhirsh

fbshipit-source-id: 230b210c3e53be1bd362105fbea8451055dc59a8
2021-04-16 11:41:46 -07:00
Edward Yang
6ec71ed4f9 Replace all direct cdata access with THPVariable_Unpack (#55799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55799

I'm going to change the implementation of cdata soon so I need to
abstract over cdata access with a function.  Additionally, many
users are casting manually casting to THPVariable to access
the member so I can remove these unsafe casts in the client code
(the implementation, of course, is still doing an unsafe cast.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27712130

Pulled By: ezyang

fbshipit-source-id: 95fcc013bf3913d67f2c634068eb5b3aab144cb3
2021-04-15 08:57:04 -07:00
albanD
1d49fd31c4 [reland] Add formulas and basic tests (#56083)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/49098
See original issue for details.

The only difference with previous PR is the fix of the _embedding_bag_dense_backward formula to stop declaring a backward formula for an argument that does not exists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56083

Reviewed By: samestep

Differential Revision: D27778221

Pulled By: albanD

fbshipit-source-id: 159ef91ca931ef2ccfbc3d1c46c7880c32919dc9
2021-04-15 07:52:43 -07:00
Sam Estep
817fd932ac Revert D25607505: Add formulas and basic tests
Test Plan: revert-hammer

Differential Revision:
D25607505 (70f5905565)

Original commit changeset: fe2315d58768

fbshipit-source-id: 519d7426a6f32f0db51c4f360e5d5a79dbaac99d
2021-04-14 14:50:43 -07:00
albanD
70f5905565 Add formulas and basic tests (#49098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49098

RFC: https://github.com/pytorch/rfcs/pull/11

This PR adds:
- Codegen support to define forward grad formulas and few manual formulas
- Codegen support to automatically generate formulas as well as few usage
- Tests for basic forward grad components

Codegen generated examples.
For each of them, the only part that is changed is the if statement before the return checking for fw grad defined.

- For manual entry:
```yaml
- name: max(Tensor self) -> Tensor
  self: evenly_distribute_backward(grad, self, result)
  result: max_forward(self_fw_grad, self, result)
```

```cpp
Tensor max(const Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<MaxBackward1> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<MaxBackward1>(new MaxBackward1(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::max(self_);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "max");
  if (isFwGradDefined(self)) {
      auto self_fw_grad = toLegacyFwGrad(self);
      auto self_primal = toLegacyPrimal(self);
      auto result_new_fw_grad = max_forward(self_fw_grad, self_primal, result);
      if (result_new_fw_grad.defined()) {
        result.set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  if (grad_fn) {
    grad_fn->result_ = SavedVariable(result, true);
  }
  return result;
}
```

- For element wise entry:
```yaml
- name: abs(Tensor self) -> Tensor
  self: grad * self.sgn()
  result: auto_element_wise
```

```cpp
Tensor abs(const Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<AbsBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<AbsBackward>(new AbsBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::abs(self_);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "abs");
  if (isFwGradDefined(self)) {
      auto self_fw_grad = toLegacyFwGrad(self);
      auto self_primal = toLegacyPrimal(self);
      auto result_new_fw_grad = self_fw_grad * self_primal.sgn();
      if (result_new_fw_grad.defined()) {
        result.set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  return result;
}
```
- For linear entry:
```yaml
- name: clone(Tensor self, *, MemoryFormat? memory_format=None) -> Tensor
  self: grad
  result: auto_linear
```

```cpp
Tensor clone(const Tensor & self, c10::optional<MemoryFormat> memory_format) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<CloneBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<CloneBackward>(new CloneBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::clone(self_, memory_format);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  if (isFwGradDefined(self)) {
      auto self_fw_grad = toLegacyFwGrad(self);
      auto result_new_fw_grad = at::clone(self_fw_grad, memory_format);
      if (result_new_fw_grad.defined()) {
        result.set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
      }
  }
  return result;
}
```

- For no entry:
```yaml
- name: angle(Tensor self) -> Tensor
  self: angle_backward(grad, self)
```

```cpp
Tensor angle(const Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );
  std::shared_ptr<AngleBackward> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<AngleBackward>(new AngleBackward(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto tmp = ([&]() {
    at::AutoNonVariableTypeMode non_var_type_mode(true);
    return at::angle(self_);
  })();
  auto result = std::move(tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value())
    AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "angle");
  TORCH_CHECK(!(isFwGradDefined(self)), "Trying to use forward prop with angle that does not support it.");
  return result;
}
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25607505

Pulled By: albanD

fbshipit-source-id: fe2315d587689af1cd5968536fa26c680b8b8829
2021-04-14 14:13:30 -07:00
Sam Estep
4753100a3b Un-ignore F403 in .flake8 (#55838)
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html

This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).

This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838

Test Plan: CI. You can also run `flake8` locally.

Reviewed By: jbschlosser

Differential Revision: D27724232

Pulled By: samestep

fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
2021-04-13 09:24:07 -07:00
Wenlei Xie
70af5db7ca Remove use_c10_dispatcher option (#54969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54969

With all use cases to hacky wrapper removed, all kernels will be
dispatched with c10 full dispatcher.
ghstack-source-id: 125434790

Test Plan: buck build //caffe2/aten/...

Reviewed By: ezyang, walterddr

Differential Revision: D27436596

fbshipit-source-id: 7a146d1f4a983b4a81f8552be4eec6c482b6bea2
2021-03-31 16:24:24 -07:00
Edward Yang
6e8c4ad7fd s/StructuredNativeFunctions/NativeFunctionsGroup/ (#54427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54427

A StructuredNativeFunctions is no longer guaranteed to actually
be structured (test structured property for that), so we rename
this to a more neutral name.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D27235380

Pulled By: ezyang

fbshipit-source-id: 2b438d615bf06a47fc9c7bf6eb66fd8b4df31bc8
2021-03-23 00:43:57 -07:00
Edward Yang
d226985257 Read out layout from options directly, rather than via backend (#54074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54074

I don't see why this shouldn't work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D27086594

Pulled By: ezyang

fbshipit-source-id: 1d5f1997017ec48c4140f43e44f0d8a3df28ac7f
2021-03-22 08:20:13 -07:00
Wenlei Xie
2ecb2c7931 Pass Scalar by reference (#53583)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53583

`Scalar` takes 32 bytes due to `c10::complex<double>`
requires aligning to 16 bytes. Passing Scalar by reference
shows about 1% improvements on instruction count.

All the changes in this commit are codemoded except for
the following 4 files (which code-gen signatures):
```
tools/codegen/api/cpp.py
tools/codegen/api/native.py
tools/codegen/api/structured.py
caffe2/contrib/aten/gen_op.py
```

# Codemode

## Main Step

For the codemod part, here is the main command used:
```
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)optional<Scalar> (\w+)' '${1}const optional<Scalar>& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)optional<Scalar> (\w+)' '${1}const optional<Scalar>& ${2}'
```

As you can tell, it codemods both `Scalar` and `optional<Scalar>`.  Apply these commands iteratively until reaching a fix-point (since one method signature might contain multiple `Scalar` parameter).

In retrospect, excluding `thrid_party` and `torch/csrc/jit` would be a good idea. (I revert it manually later, see https://github.com/pytorch/pytorch/pull/53479 as an reference).

## Pre-Step

Prior to applying the main command,  as some `Scalar` are presented as `at::Scalar` or `c10::Scalar`, so I codemod some of them in advance. Here is an incomplete list:
```
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)at::Scalar (\w+)' '${1}const at::Scalar& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)at::Scalar (\w+)' '${1}const at::Scalar& ${2}'
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)c10::optional<Scalar> (\w+)' '${1}const c10::optional<Scalar>& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)c10::optional<Scalar> (\w+)' '${1}const c10::optional<Scalar>& ${2}'
```

## Fixup
There are a couple of post codemod fixup. For example, `const Scalar` will be codemoded into `const const Scalar&`. `at:Scalar` will be codemoded into `at::const Scalar&`  (if `Pre-step` is not done comprehensively). Here is an incomplete list:
```
fastmod --extensions cpp 'const const Scalar' 'const Scalar'
fastmod --extensions h 'const const c10::optional<Scalar>' 'const c10::optional<Scalar>'
fastmod --extensions cpp 'const const c10::optional<Scalar>' 'const c10::optional<Scalar>'
fastmod 'at::const Scalar&' 'const at::Scalar&'
```

## Supplementary

`cu` and `mm` files also need to be codemoded, for example:

```
fastmod --extensions cu 'at::const Scalar&' 'const at::Scalar&'
fastmod --extensions mm '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
```

Function pointers are not codemoded. Here is an incomplete list:

```
# Cover case: using index_fill_fn = void(*)(TensorIterator & iter, int64_t dim, int64_t self_dim_size, int64_t self_dim_stride, Scalar source);
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'

# Cover case: using softplus_fn = void (*)(TensorIterator&, Scalar, Scalar);
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar([, \)])' '${1}const Scalar&${2}'
fastmod --extensions cpp '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar([, \)])' '${1}const Scalar&${2}'
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)optional<Scalar>([, \)])' '${1}const optional<Scalar>&${2}'
```

Some corner cases needs to be manually fixed.

ghstack-source-id: 123970306

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D26904445

fbshipit-source-id: 8d8a002af4b5125f153a32f03c6956be7ae5671d
2021-03-15 23:17:06 -07:00
Ailing Zhang
aeb3e93351 Move view handling logic to gen_inplace_or_view_type.py (#53341)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53341

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26973912

Pulled By: ailzhang

fbshipit-source-id: ea31bdef0beac6996d509f5d45ebefa3ea8e2b89
2021-03-11 21:25:15 -08:00
Ailing Zhang
9f75de278f Move common autograd utils functions from gen_variable_type.py to api/autograd.py. (#53340)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53340

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26973914

Pulled By: ailzhang

fbshipit-source-id: 8367a08b27b25808782c77aadc3c67d07c354957
2021-03-11 19:58:45 -08:00
Edward Yang
37bf6c134b Register DefaultBackend implementations for functional/inplace structured operators (#53037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53037

As remarked in #52277 it is easy to give an (inefficient, due to extra
redispatches) DefaultBackend implementation of foo and foo_ in terms of
foo_out.  This patch enables code generation for DefaultBackend in these
cases by default for all structured kernels.  You can see the payoff
in MSNPU extension: it only has to register a kernel for add.out, and it
gets add and add_ kernels automatically.

The actual code changes are very modest:
- When DefaultBackend, call the dispatched (not direct native::)
  functions to allocate tensors, change device guard, etc
- Don't call impl() for DefaultBackend (as it doesn't exist); instead,
  directly generate a call to at::foo_out to do the actual work.
- Do NOT generate DefaultBackend implementation for foo_out.  Actually,
  there is a case to be made for this being a good idea with more infra;
  see comments inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26731225

Pulled By: ezyang

fbshipit-source-id: 939da7cb69f694722ec293e5e42e74a755dd0985
2021-03-02 14:13:08 -08:00
Edward Yang
c5a67f1675 Fix minor inaccuracy in translate error reporting (#53032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53032

Previously, you could get this error message:

```
Failed to synthesize the expression "Tensor & out".
When I failed, the following bindings were available in the context:

  const Tensor & self;
  const Tensor & other;
  Scalar alpha;
  const Tensor & op.outputs_[0];
```

There's a problem with this error message: it doesn't seem like there
is any 'out' argument available, but actually there is: the last
binding in the context is it.  We printed the *expression*, not
the *ctype name*.

After this patch, the context now prints as:

```
  const Tensor & self; // self
  const Tensor & other; // other
  Scalar alpha; // alpha
  const Tensor & out; // op.outputs_[0]
```

Now it becomes clear that it's a const mismatch.  Maybe we could also
beef up the error message so it points out near misses, but I'll leave
that to future work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D26729768

Pulled By: ezyang

fbshipit-source-id: adb363551a7145eac788943c20969c86b1f8a81b
2021-03-02 14:11:28 -08:00
Brian Hirsh
d02a2bd5d1 codegen'd API for redispatching (#52008)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52008

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D26356079

Pulled By: bdhirsh

fbshipit-source-id: 1fd34fbb4dbc48cc8390cad99e30e0d04fc75a4f
2021-02-22 10:55:38 -08:00
Jiakai Liu
c9c4b871a5 [pytorch] reintroduce static dispatch (#51957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51957

This is a simplified version of #51554.

Compared to #51554, this version only supports statically dispatching to
a specific backend. The benefit is that it skipped the dispatch key
computation logic thus has less framework overhead. The downside is that
if input tensors do not match the specified backend it will throw error
instead of falling back to regular dispatch.

Sample code:
```
Tensor empty(IntArrayRef size, TensorOptions options, c10::optional<MemoryFormat> memory_format) {
    return at::cpu::empty(size, options, memory_format);
}

// aten::conj(Tensor(a) self) -> Tensor(a)
Tensor conj(const Tensor & self) {
    return at::math::conj(self);
}

// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_out(Tensor & out, const Tensor & self) {
    return at::cpu::conj_out(out, self);
}

// aten::conj.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
Tensor & conj_outf(const Tensor & self, Tensor & out) {
    return at::cpu::conj_out(out, self);
}

// aten::_conj(Tensor self) -> Tensor
Tensor _conj(const Tensor & self) {
    return at::defaultbackend::_conj(self);
}
```

For ops without the specific backend dispatch, it will throw error:
```
// aten::_use_cudnn_ctc_loss(Tensor log_probs, Tensor targets, int[] input_lengths, int[] target_lengths, int blank) -> bool
bool _use_cudnn_ctc_loss(const Tensor & log_probs, const Tensor & targets, IntArrayRef input_lengths, IntArrayRef target_lengths, int64_t blank) {
    TORCH_CHECK(false, "Static dispatch does not support _use_cudnn_ctc_loss for CPU.");
}
```

Differential Revision: D26337857

Test Plan: Imported from OSS

Reviewed By: bhosmer

Pulled By: ljk53

fbshipit-source-id: a8e95799115c349de3c09f04a26b01d21a679364
2021-02-19 11:41:39 -08:00
Edward Yang
4d85e30133 Support at::cpu on non-structured kernels (#51590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51590

This PR backports a subset of Jiakai's changes from
https://github.com/pytorch/pytorch/pull/51554 that adds support
for at::cpu in non-structured kernels.

The unusual bits:

- Need to add a new forward inference rule for doing conversions
  of const optional<Tensor>& to const Tensor&
- Need to give the wrapper functions a prefix so that the call to
  wrapper is not ambiguous

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D26209871

Pulled By: ezyang

fbshipit-source-id: 8162686039675ab92a2af7a14f6b18941f8944df
2021-02-04 09:19:45 -08:00
Edward Yang
81c7c3bae5 Add api.structured; switch structured kernels to use const Tensor& everywhere (#51490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51490

Mutable Tensor ref is a source of endless confusion for kernel writers;
if we're going to make everyone rewrite their kernels, might as well
also get rid of mutable Tensor& while we're at it.

This is a refactor-then-small-update double whammy.  The refactor
is to separate tools.codegen.api.structured from api.native for
describing the type signatures of structured kernels (previously,
I was naughtily reusing native for this purpose--now I need it to
behave differently as Tensor).  This started off as a copy paste, but
since there are not that many structured kernels so far I could delete
all of the legacy logic from native that didn't make sense (without
having to go out and fix all the use sites all at once).

One more small addition was teaching translate to convert Tensor& to const Tensor&.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26182413

Pulled By: ezyang

fbshipit-source-id: ed636866add3581179669cf9283f9835fcaddc06
2021-02-03 14:03:46 -08:00