Commit Graph

225 Commits

Author SHA1 Message Date
rzou
ea141d8134 functional compiled autograd (#144707)
This PR squashes together the following commits:

https://github.com/pytorch/pytorch/pull/144115
https://github.com/pytorch/pytorch/pull/143417
https://github.com/pytorch/pytorch/pull/143405
https://github.com/pytorch/pytorch/pull/143387
https://github.com/pytorch/pytorch/pull/143304
https://github.com/pytorch/pytorch/pull/143296

This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses.

For more information, please read the commit messages for each PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707
Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel
2025-01-27 05:20:56 +00:00
Nikita Shulga
82e353fffc [BE] Use nested namespaces in autograd/templates (#110618)
As PyTorch can now use C++17 language features
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110618
Approved by: https://github.com/soulitzer
2023-10-05 22:05:57 +00:00
Jason Ansel
c902b84e0b Compiled autograd (#103822)
This branch:
1) converts the autograd tape into an FX graph
2) caches that conversion using a "shadow" graph
3) compiles and runs the generated FX graph instead of the normal autograd

What works currently:
1) Caching, capture, and initial integration
2) Backwards hooks
3) Inlining AotAutograd generated subgraphs
4) torch.compiling the generated FX graph
5) Auto-detecting dynamic shapes based on changes

Future work
1) Larger scale testing
1) Boxed calling convention, so memory can be freed incrementally
1) Support hooks on SavedTensor
1) Additional testing by running eager autograd tests under compiled_autograd.enable()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103822
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-24 21:12:05 +00:00
albanD
73f009a2aa refactor manual function definitions (#43711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43711

this makes them available in forward if needed

No change to the file content, just a copy-paste.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D23454146

Pulled By: albanD

fbshipit-source-id: 6269a4aaf02ed53870fadf8b769ac960e49af195
2020-09-02 09:23:21 -07:00
Lucas Roberts
2bede78a05 add qr_backward functionality for wide case (#42216)
Summary:
Unblocks implementation of https://github.com/pytorch/pytorch/issues/27036. Note that this PR ***does not*** fix #{27036}.
Currently QR decomposition only has support for square and tall (a.k.a. skinny) case.
This PR adds functionality for wide A matrix/tensors, includes 3 unit tests for the new case
and restructures the `qr_backward` method to use the same Walther method as a helper.

cc albanD t-vi

I don't have a gpu machine so haven't tested on cuda but everything passes on my local machine in cpu.

The basic idea of the PR is noted in the comments in the `Functions.cpp` file but I'll note here too for clarity:

let <img src="https://render.githubusercontent.com/render/math?math=A_{m,n}"> be a matrix and <img src="https://render.githubusercontent.com/render/math?math=m < n"> then partition <img src="https://render.githubusercontent.com/render/math?math=A_{m, n}"> as  <img src="https://render.githubusercontent.com/render/math?math=A_{m,n} = [ X_{m,m} |\ Y_{m, n-m} ]">
and take QR of <img src="https://render.githubusercontent.com/render/math?math=X"> and call that one
<img src="https://render.githubusercontent.com/render/math?math=X=QU"> the <img src="https://render.githubusercontent.com/render/math?math=Q"> here from <img src="https://render.githubusercontent.com/render/math?math=X"> is the same as the <img src="https://render.githubusercontent.com/render/math?math=Q"> from <img src="https://render.githubusercontent.com/render/math?math=QR"> on entire <img src="https://render.githubusercontent.com/render/math?math=A"> matrix. Then transform <img src="https://render.githubusercontent.com/render/math?math=Y"> with the <img src="https://render.githubusercontent.com/render/math?math=Q"> rotation got from <img src="https://render.githubusercontent.com/render/math?math=X"> to get <img src="https://render.githubusercontent.com/render/math?math=V=Q^{T}Y"> now <img src="https://render.githubusercontent.com/render/math?math=R= [U |\ V] "> and similarly for the grads of each piece, e.g. if <img src="https://render.githubusercontent.com/render/math?math=\bar{A}"> is  `grad_A` then
<img src="https://render.githubusercontent.com/render/math?math=\bar{A} = [ \bar{X} |\ \bar{Y}]"> and <img src="https://render.githubusercontent.com/render/math?math=\bar{R} = [ \bar{U} |\ \bar{V}]"> and then
<img src="https://render.githubusercontent.com/render/math?math=\bar{Y} =  Q\bar{V}"> and
<img src="https://render.githubusercontent.com/render/math?math=\bar{V}"> is the `narrow()` of `grad_R`.
<img src="https://render.githubusercontent.com/render/math?math=\bar{X}"> is calculated very similar to the original Walther formula (exactly the same in the tall and square cases) but is slightly modified here for wide case matrices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42216

Reviewed By: glaringlee

Differential Revision: D23373118

Pulled By: albanD

fbshipit-source-id: 3702ba7e7e23923868c02cdb7e10a96036052344
2020-08-31 11:46:45 -07:00
Xiang Gao
a860be898e [resubmit] Add amax/amin (#43819)
Summary:
Resubmit for landing next week.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819

Reviewed By: ngimel

Differential Revision: D23421906

Pulled By: mruberry

fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f
2020-08-31 04:54:48 -07:00
Nikita Shulga
3f0120edb4 Revert D23360705: [pytorch][PR] Add amax/amin
Test Plan: revert-hammer

Differential Revision:
D23360705 (bcec8cc3f9)

Original commit changeset: 5bdeb08a2465

fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381
2020-08-28 18:01:25 -07:00
Gao, Xiang
bcec8cc3f9 Add amax/amin (#43092)
Summary:
Add a max/min operator that only return values.

## Some important decision to discuss
| **Question**                          | **Current State** |
|---------------------------------------|-------------------|
| Expose torch.max_values to python?    | No                |
| Remove max_values and only keep amax? | Yes               |
| Should amax support named tensors?    | Not in this PR    |

## Numpy compatibility

Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html

| Parameter                                                                                                                                                                                                                                              | PyTorch Behavior                                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `axis`:  None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. | Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137)                                |
| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output.                                                                                                   | Same                                                                              |
| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.                                      | implemented as `keepdim`                                                          |
| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice.                                                                                                                              | Not implemented in this PR. Better to implement for all reductions in the future. |
| `where`: array_like of bool, optional. Elements to compare for the maximum.                                                                                                                                                                            | Not implemented in this PR. Better to implement for all reductions in the future. |

**Note from numpy:**
> NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.

PyTorch has the same behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092

Reviewed By: ngimel

Differential Revision: D23360705

Pulled By: mruberry

fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d
2020-08-28 12:51:03 -07:00
Xiang Gao
348e78b086 Evenly distribute output grad into all matching inputs for min/max/median (#43519)
Summary:
cc: ngimel mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43519

Reviewed By: albanD

Differential Revision: D23312235

Pulled By: ngimel

fbshipit-source-id: 678bda54996df7f29acf96add928bb7042fc2069
2020-08-25 16:36:33 -07:00
Xiaomeng Yang
4ae832e106 Optimize SiLU (Swish) op in PyTorch (#42976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976

Optimize SiLU (Swish) op in PyTorch.

Some benchmark result

input = torch.rand(1024, 32768, dtype=torch.float, device="cpu")
forward: 221ms -> 133ms
backward: 600ms -> 170ms

input = torch.rand(1024, 32768, dtype=torch.double, device="cpu")
forward: 479ms -> 297ms
backward: 1438ms -> 387ms

input = torch.rand(8192, 32768, dtype=torch.float, device="cuda")
forward: 24.34ms -> 9.83ms
backward: 97.05ms -> 29.03ms

input = torch.rand(4096, 32768, dtype=torch.double, device="cuda")
forward: 44.24ms -> 30.15ms
backward: 126.21ms -> 49.68ms

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU"

Reviewed By: houseroad

Differential Revision: D23093593

fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd
2020-08-16 13:21:57 -07:00
kshitij12345
ab0a04dc9c Add torch.nansum (#38628)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628

Reviewed By: VitalyFedyunin

Differential Revision: D22860549

Pulled By: mruberry

fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710
2020-08-11 22:26:04 -07:00
Sebastian Messmer
1542c41a67 Change C++ frontend to take optional<Tensor> arguments (#41947)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41947

Previously, if an op took an optional `Tensor?` argument, the C++ frontend (i.e. `at::op()` and `Tensor::op()`)
were generated to take `Tensor`. A previous PR (https://github.com/pytorch/pytorch/pull/41610) changed the kernels
to be written with `c10::optional<Tensor>` instead of `Tensor`, but that did not touch the C++ frontend yet.

This PR changes the C++ frontend API to take `c10::optional<Tensor>` instead of `Tensor` as well.
This should be mostly bc conserving. Since `Tensor` implicitly converts to `c10::optional<Tensor>`, any old code
calling an op with a `Tensor` would still work. There are likely corner cases that get broken though.
For example, C++ only ever does *one* implicit conversion. So if you call an op with a non-tensor object
that gets implicitly converted to a `Tensor`, then that previously worked since the API took a `Tensor` and
C++ allows one implicit conversion. Now it wouldn't work anymore because it would require two implicit conversions
(to `Tensor` and then to `c10::optional<Tensor>`) and C++ doesn't do that.

The main reasons for doing this are
- Make the C++ API more sane. Those arguments are optional and that should be visible from the signature.
- Allow easier integration for XLA and Autocast. Those backends generate code to wrap operators and forward
  operator arguments to calls to at::op(). After https://github.com/pytorch/pytorch/pull/41610, there was
  a mismatch because they had to implement operators with `optional<Tensor>` but call `at::op()` with `Tensor`,
  so they had to manually convert between those. After this PR, they can just forward the `optional<Tensor>`
  in their call to `at::op()`.
ghstack-source-id: 108873705

Test Plan: unit tests

Reviewed By: bhosmer

Differential Revision: D22704832

fbshipit-source-id: f4c00d457b178fbc124be9e884a538a3653aae1f
2020-07-31 16:11:55 -07:00
Vinnam Kim
825a387ea2 Fix bug on the backpropagation of LayerNorm when create_graph=True (#41595)
Summary:
Solve an issue https://github.com/pytorch/pytorch/issues/41332

I found the bug at https://github.com/pytorch/pytorch/issues/41332 is caused by LayerNorm.

Current implementations of LayerNorm have a disparity between
1. [`create_graph=False` CUDA implementation](dde3d5f4a8/aten/src/ATen/native/cuda/layer_norm_kernel.cu (L145))
2. [`create_graph=True` implementation](dde3d5f4a8/tools/autograd/templates/Functions.cpp (L2536))

With this bug-fix, https://github.com/pytorch/pytorch/issues/41332 is solved.

Ailing BIT-silence

Signed-off-by: Vinnam Kim <vinnamkim@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41595

Reviewed By: houseroad

Differential Revision: D22598415

Pulled By: BIT-silence

fbshipit-source-id: 63e390724bd935dc8e028b4dfb75d34a80558c3a
2020-07-22 00:19:12 -07:00
Xiaomeng Yang
80d5b3785b Add torch.logit function (#41062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062

Add torch.logit function

Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit"

Reviewed By: hl475

Differential Revision: D22406912

fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606
2020-07-13 19:33:20 -07:00
Kurt Mohler
bba30d1bd8 Add undefined tensor gradient support to all backward functions (#39400)
Summary:
Adds the ability for all backward functions to accept undefined output gradient arguments. An undefined gradient is a Tensor that was created by the argumentless constructor `at::Tensor()`, where `tensor.defined() == false`.

Also adds new autograd nodes, UndefinedGrad and UndefinedGradBackward, that can be used from within Python code to inject undefined gradients into a backward function. A new test case is added to the backward function unit tests to use the UndefinedGrad node to ensure that undefined gradients do not break any backward functions.

Closes https://github.com/pytorch/pytorch/issues/33138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39400

Differential Revision: D21936588

Pulled By: albanD

fbshipit-source-id: eccc5f55c77babe6dadcea4249d0c68a3c64e85d
2020-06-08 14:13:53 -07:00
Xiaomeng Yang
03eca384fd Optimize GroupNorm on CPU (#28203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28203

Optimize GroupNorm on CPU
ghstack-source-id: 105149765

Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm"

Reviewed By: houseroad

Differential Revision: D17901506

fbshipit-source-id: 5eb22ad0e8a9ab2533282b967b2818f690e48865
2020-06-03 23:52:16 -07:00
Aayush Naik
0829cadca3 Implement rad2deg, deg2rad (#38852)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/38372.

cc mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38852

Differential Revision: D21868935

Pulled By: mruberry

fbshipit-source-id: ae6ded11b743c9d1cdc032984b4abe0a115290d6
2020-06-03 22:21:54 -07:00
kshitij12345
3487744821 Add torch.logcumsumexp (#36308)
Summary:
Creating new PR as I am unable to push to pandeykartikey 's branch as I don't have the permissions.

Closes https://github.com/pytorch/pytorch/issues/26411

Based on https://github.com/pytorch/pytorch/issues/32876 Thanks pandeykartikey for starting this out.

Have addressed the comments.

anjali411 agadetsky albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36308

Differential Revision: D21648573

Pulled By: albanD

fbshipit-source-id: bc1a8fc4ab474a1148298117a1549b0e46f7c3ff
2020-05-21 09:12:31 -07:00
Nik Ved
a7c29dbfa2 unfold_backward gets its own kernel (#36612)
Summary:
`unfold_backward` uses `index_add` which causes regression on CUDA because of the underlying `atomicAdd`, and regression on CPU because of limited parallelization. This PR attempts to replace `index_add` with a custom kernel.

Fixes [https://github.com/pytorch/pytorch/issues/17501](https://github.com/pytorch/pytorch/issues/17501).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36612

Differential Revision: D21450349

Pulled By: albanD

fbshipit-source-id: 09ec1fbd5d7290656700eca8e7fb7cf52323ec28
2020-05-08 13:18:36 -07:00
Nik Ved
8434247653 modify select_equals_backward to propage only to a single value (#36316)
Summary:
Renames `select_equals_backward` to `select_first_equal_backward` and makes sure it propagates to a single value.
Fixes [https://github.com/pytorch/pytorch/issues/35699](https://github.com/pytorch/pytorch/issues/35699).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36316

Differential Revision: D21403848

Pulled By: albanD

fbshipit-source-id: b260cd79289162ee5733887d2afe8203945baee6
2020-05-06 10:50:24 -07:00
Emilio Castillo
25ba802ce4 Fix cdist backward calculation for p=2 (#37337)
Summary:
Closes https://github.com/pytorch/pytorch/issues/37154

Fixes a bug in `cdist` backward with `p=2`.
Under some circumstances, if the output has 0s, the gradient calculation of `sqrt` will be undefined. Leading to NaNs in the input gradients.

This PR defines a subgradient for this case.

A test is also added to verify this behavior, I was only able to reproduce it under certain shapes, so the shape is explicitly taken from https://github.com/pytorch/pytorch/issues/37154 example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37337

Differential Revision: D21403178

Pulled By: albanD

fbshipit-source-id: deef9678c1958524b552504920f19617f9ad1da6
2020-05-05 14:13:37 -07:00
Nikita Shulga
a5af478f29 Use full include path in autogenerated Functions.cpp (#35924)
Summary:
Preliminary step to merge https://github.com/pytorch/pytorch/pull/35220
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35924

Test Plan: CI

Differential Revision: D20832159

Pulled By: malfet

fbshipit-source-id: 29ff2e3c04c08c39c49f35414f94b76f0651859a
2020-04-02 22:46:09 -07:00
Nik Ved
35cdb78522 Make kl_div accept target in log space (#34586)
Summary:
Fixes [32520](https://github.com/pytorch/pytorch/issues/32520), implements [34536](https://github.com/pytorch/pytorch/issues/34536).

Here are some benchmarks:
```python
import torch
import torch.nn.functional as F
from IPython import get_ipython

ipython = get_ipython()

torch.set_num_threads(1)

for d in [5, 10, 20, 50, 100, 1000]:
    i = torch.rand(d, d)
    t = torch.rand(d, d)
    print(f"Size: {d}x{d}")
    ipython.magic("timeit F.kl_div(i, t, reduction='none', log_target=False)")
    ipython.magic("timeit F.kl_div(i, t.log(), reduction='none', log_target=True)")
```
Output:
```
Size: 5x5
16 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.24 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 10x10
16.7 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.7 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 20x20
17.7 µs ± 47.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.7 µs ± 28.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 50x50
23.6 µs ± 60.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
15 µs ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Size: 100x100
42.8 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Size: 1000x1000
3.9 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.45 ms ± 364 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34586

Differential Revision: D20652726

Pulled By: ezyang

fbshipit-source-id: 480697b4cd01341bbeee7514a8b812705a0600ea
2020-04-01 12:26:58 -07:00
Michael Ranieri
51d969e86a preprocessor cleanup (#33957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33957

lots of small preprocessor warning cleanup for windows

Test Plan: CI green

Reviewed By: malfet, albanD

Differential Revision: D20153582

fbshipit-source-id: 18fd61c466fd1f55ededdae4448b3009a9cedc04
2020-03-02 13:37:19 -08:00
mfkasim91
9d94f56ce0 Backward operation of torch.eig for real eigenvalues (#33090)
Summary:
Another pull request to follow up issue https://github.com/pytorch/pytorch/issues/32531.
Here I implemented the backward operation for `torch.eig` with a condition that all the eigenvalues are real.

This pull request is independent of my another pull request https://github.com/pytorch/pytorch/issues/32932, which means that there is no dependency between this PR and my another PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33090

Differential Revision: D19814347

Pulled By: albanD

fbshipit-source-id: 2fae30964e97987abb690544df8240aedeae56e8
2020-02-10 09:52:56 -08:00
anjali411
5b815d980e Added cummin
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32238

Differential Revision: D19416791

Pulled By: anjali411

fbshipit-source-id: 5aadc0a7a55af40d76f444ab7d7d47ec822f55a5
2020-01-17 10:51:58 -08:00
anjali411
8dc67a014f Add cummax
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32169

Differential Revision: D19393236

Pulled By: anjali411

fbshipit-source-id: 5dac6b0a4038eb48458d4a0b253418daeccbb6bc
2020-01-14 17:19:10 -08:00
Alexander Golynski
b783a75aa3 Fix scalar^tensor derivative for scalars that are zero
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32063

Test Plan: Imported from OSS

Differential Revision: D19394258

Pulled By: agolynski

fbshipit-source-id: 3eed0f9cc1b8c677c6948c927d007044be67fe7f
2020-01-14 11:11:23 -08:00
Alexander Golynski
fa60e1150d Fix tensor^tensor derivative for 0 base entries
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32062

Test Plan: Imported from OSS

Differential Revision: D19394259

Pulled By: agolynski

fbshipit-source-id: 836525e03573af838511ad5b4cc87ec2c1536a5e
2020-01-14 11:10:25 -08:00
leetanenbaum
5988d36f58 Fix cumprod error for tensors with zero elements (#32070)
Summary:
Currently cumprod crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both dim() and numel() in cumprod backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32070

Differential Revision: D19373200

Pulled By: ezyang

fbshipit-source-id: d8ecde33f3330b40a7c611f6faa3b1d707ef2a9a
2020-01-13 09:50:27 -08:00
leetanenbaum
0b9cd410a9 Fix cumsum error for tensors with zero elements (#31694)
Summary:
Currently `cumsum` crashes for tensors with non-empty dimensions but with zero elements, which could happen when some dimension is zero. This commit fixes the error by checking both `dim()` and `numel()` in cumsum backward

Fixes https://github.com/pytorch/pytorch/issues/31515
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31694

Reviewed By: mrshenli

Differential Revision: D19266613

Pulled By: leedtan

fbshipit-source-id: 9407e0aa55440fed911c01a3580bb6c5eab62a16
2020-01-03 10:16:46 -08:00
Vitaly Fedyunin
e46babb637 explicitly provide memory format when calling to *_like operators
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30007

Test Plan: Imported from OSS

Differential Revision: D18575982

Pulled By: VitalyFedyunin

fbshipit-source-id: 83be0857fe1080216cd09547a2b3d34455a0cce4
2019-11-19 16:19:24 -08:00
Vitaly Fedyunin
dc9e7b73e1 explicitly provide memory format when calling to *_like operators (Redo of e3e06549)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30004

Test Plan: Imported from OSS

Differential Revision: D18575977

Pulled By: VitalyFedyunin

fbshipit-source-id: 344e9a11c93c7e4a822f424c94fa2255592d118e
2019-11-19 16:19:11 -08:00
Vitaly Fedyunin
20b73e1805 explicitly provide memory format when calling to *_like operators (Redo of 631b22d)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30001

Test Plan: Imported from OSS

Differential Revision: D18575979

Pulled By: VitalyFedyunin

fbshipit-source-id: d6fe8a6e1b45673f85a0dd49bd6becfadc5091b4
2019-11-19 16:18:58 -08:00
Igor Fedan
75309b45f3 explicitly provide memory format when calling to clone() at Indexing.cpp
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28660

Test Plan: Imported from OSS

Differential Revision: D18333346

Pulled By: ifedan

fbshipit-source-id: 06590205d883a5096388a4ae318389244130972d
2019-11-07 05:38:32 -08:00
Vitaly Fedyunin
e3e06549c1 Autogenerated contiguous memory format for old *_like calls
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29225

Test Plan: Imported from OSS

Differential Revision: D18330964

Pulled By: VitalyFedyunin

fbshipit-source-id: f357a0cc125bd90a62575bd461722b9e36e75cbf
2019-11-06 07:24:34 -08:00
Vitaly Fedyunin
d410fc5a81 Autogenerated contiguous memory format for old *_like calls
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29222

Test Plan: Imported from OSS

Differential Revision: D18330966

Pulled By: VitalyFedyunin

fbshipit-source-id: 9e8da4e826cc43fac9828737ef744606491812a4
2019-11-06 07:24:21 -08:00
Xiaomeng Yang
2460dced8f Add torch.nn.GELU for GELU activation (#28944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28944

Add torch.nn.GELU for GELU activation

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GELU"

Reviewed By: hl475, houseroad

Differential Revision: D18240946

fbshipit-source-id: 6284b30def9bd4c12bf7fb2ed08b1b2f0310bb78
2019-11-03 21:55:05 -08:00
titaneric
82f31e02a3 Remove the redundant calculation of derivative of power function (#28651)
Summary:
Hi, I notice that the pytorch faced the the issue as HIPS/autograd#541 .
I try to solve it, hope it can help.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28651

Reviewed By: gchanan

Differential Revision: D18137163

Pulled By: albanD

fbshipit-source-id: 888bef65c72c4c15c2acdd4b13d5041008b1354e
2019-10-28 08:37:04 -07:00
Igor Fedan
bc57967e07 max_pool2d cuda should have channel last optimized kernels[Performance improvement] (#24872)
Summary:
max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels(https://github.com/pytorch/pytorch/issues/23815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24872

Differential Revision: D16964577

Pulled By: ifedan

fbshipit-source-id: 296dfef8e511a7ae2ed423e34e902d5401b3becb
2019-10-21 11:28:12 -07:00
Lu Fang
e9a91756cd Back out "[pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU)"
Summary: Original commit changeset: 9ddffe4dbbfa

Test Plan: ci

Reviewed By: yf225

Differential Revision: D17939581

fbshipit-source-id: 44a3b843bf1e7059fec57b9e3d12ed4886816145
2019-10-15 21:12:10 -07:00
Edward Yang
2aa84d927b Revert D17939700: Revert D17889288: [pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU)
Test Plan: revert-hammer

Differential Revision:
D17939700

Original commit changeset: 4fc6156ba388

fbshipit-source-id: dded0a2140d2c14cd2f2a574987ecc164b0e5bfe
2019-10-15 15:24:36 -07:00
Edward Yang
c44e33b578 Revert D17889288: [pytorch][PR] Migrate soft_margin_loss from the TH to Aten (CUDA+CPU)
Test Plan: revert-hammer

Differential Revision:
D17889288

Original commit changeset: 9ddffe4dbbfa

fbshipit-source-id: 4fc6156ba38834512b2f735ac0d03e34e69b7286
2019-10-15 14:35:28 -07:00
Thomas Viehmann
f461184505 Use grad_out for cudnn CTC loss (#27039)
Summary:
Using grad_out for CuDNN CTC loss fixes: https://github.com/pytorch/pytorch/issues/26797, https://github.com/pytorch/pytorch/issues/25833.

We also fix a cudnn incompatible change that surfaced during the testing: As of CuDNN 7.6 the semantics of the CTC loss gradients are different.
This leads us to disable CuDNN CTC for CuDNN < 7.6. To mitigate the impact on users, we convert the parameters for the native implementation if CuDNN isn't applicable (previously this would give an error.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27039

Differential Revision: D17910815

Pulled By: ngimel

fbshipit-source-id: 465b33612d3402f10c355aa7026a7e1ffaef3073
2019-10-15 11:36:37 -07:00
Divyansh Singhvi
3397d41b8a Wrapping namespace Reduction in namespace at (#26606) (#27422)
Summary:
1) Wrapped namespace `Reduction` in namespace `at`
2) Prefixed `at::` wherever `Reduction::` is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27422

Differential Revision: D17913759

Pulled By: yf225

fbshipit-source-id: 8f00ca01cad2e7f673d316b128abf59c026e216c
2019-10-15 11:05:40 -07:00
Andreas Koepf
9033ace9c4 Migrate soft_margin_loss from the TH to Aten (CUDA+CPU) (#27673)
Summary:
Replaces fused TH kernels with a 2-liner of regular Tensor functions.
Benchmarking revealed that performance improves compared to PyTorch 1.2.

Refs: https://github.com/pytorch/pytorch/issues/24631, https://github.com/pytorch/pytorch/issues/24632, https://github.com/pytorch/pytorch/issues/24764, https://github.com/pytorch/pytorch/issues/24765
VitalyFedyunin

### Benchmarking results on my laptop:

## 1.4.0a0+f63c9e8 output
```
PyTorch version: 1.4.0a0+f63c9e8
CPU Operator sanity check:
tensor(0.5926, grad_fn=<MeanBackward0>)
tensor([-0.0159, -0.0170, -0.0011, -0.0083, -0.0140, -0.0217, -0.0290, -0.0262,
        -0.0078, -0.0129])
double backward
tensor(-0.1540, grad_fn=<SumBackward0>)
ok

GPU Operator sanity check:
tensor(0.5601, device='cuda:0', grad_fn=<MeanBackward0>)
tensor([-0.0393, -0.0316, -0.0233, -0.0140, -0.0141, -0.0161, -0.0322, -0.0238,
        -0.0054, -0.0151], device='cuda:0')
double backward
tensor(-0.2148, device='cuda:0', grad_fn=<SumBackward0>)
ok

CPU warmup 1000 took 9.025700273923576e-05
CPU warmup 10000 took 0.0009383050055475906
CPU warmup 100000 took 0.0015631120040779933
CPU warmup TOTAL time 0.0026368020044174045
CPU forward 1000 took 6.919399311300367e-05
CPU forward 10000 took 0.00014462800754699856
CPU forward 100000 took 0.0011234670091653243
CPU forward 1000000 took 0.014555767003912479
CPU forward 10000000 took 0.13409724000666756
CPU forward 100000000 took 1.246048310000333
CPU forward TOTAL time 1.3961777170043206
CPU for- & backward 1000 took 0.0003219560021534562
CPU for- & backward 10000 took 0.00037290599721018225
CPU for- & backward 100000 took 0.001975035003852099
CPU for- & backward 1000000 took 0.02621342398924753
CPU for- & backward 10000000 took 0.2944270490115741
CPU for- & backward 100000000 took 1.6856628700043075
CPU for- & backward TOTAL time 2.0091958299890393

GPU warmup 1000 took 0.0002462909906171262
GPU warmup 10000 took 9.991199476644397e-05
GPU warmup 100000 took 0.00034347400651313365
GPU warmup TOTAL time 0.0007382350013358518
GPU forward 1000 took 9.67290106927976e-05
GPU forward 10000 took 9.349700121674687e-05
GPU forward 100000 took 9.384499571751803e-05
GPU forward 1000000 took 0.0004975290066795424
GPU forward 10000000 took 0.0017606960027478635
GPU forward 100000000 took 0.003572814996005036
GPU forward TOTAL time 0.006185991995153017
GPU for- & backward 1000 took 0.00035818999458570033
GPU for- & backward 10000 took 0.0003240450023440644
GPU for- & backward 100000 took 0.0003223370003979653
GPU for- & backward 1000000 took 0.00036740700306836516
GPU for- & backward 10000000 took 0.0003690610028570518
GPU for- & backward 100000000 took 0.0003672500024549663
GPU for- & backward TOTAL time 0.002197896988946013
```

## 1.2 output
```
PyTorch version: 1.2.0
CPU Operator sanity check:
tensor(0.5926, grad_fn=<SoftMarginLossBackward>)
tensor([-0.0159, -0.0170, -0.0011, -0.0083, -0.0140, -0.0217, -0.0290, -0.0262,
        -0.0078, -0.0129])
double backward
tensor(-0.1540, grad_fn=<SumBackward0>)
ok

GPU Operator sanity check:
tensor(0.5601, device='cuda:0', grad_fn=<SoftMarginLossBackward>)
tensor([-0.0393, -0.0316, -0.0233, -0.0140, -0.0141, -0.0161, -0.0322, -0.0238,
        -0.0054, -0.0151], device='cuda:0')
double backward
tensor(-0.2148, device='cuda:0', grad_fn=<SumBackward0>)
ok

CPU warmup 1000 took 8.422900282312185e-05
CPU warmup 10000 took 0.00036992700188420713
CPU warmup 100000 took 0.003682684007799253
CPU warmup TOTAL time 0.004169487991021015
CPU forward 1000 took 5.521099956240505e-05
CPU forward 10000 took 0.00036948200431652367
CPU forward 100000 took 0.003762389998883009
CPU forward 1000000 took 0.03725024699815549
CPU forward 10000000 took 0.3614480490068672
CPU forward 100000000 took 3.6139175269927364
CPU forward TOTAL time 4.016912263003178
CPU for- & backward 1000 took 0.0002734809968387708
CPU for- & backward 10000 took 0.0006605249946005642
CPU for- & backward 100000 took 0.005437346000690013
CPU for- & backward 1000000 took 0.051245586000732146
CPU for- & backward 10000000 took 0.5291594529990107
CPU for- & backward 100000000 took 5.23841712900321
CPU for- & backward TOTAL time 5.8253340990049765

GPU warmup 1000 took 0.0005757809994975105
GPU warmup 10000 took 0.0004058420017827302
GPU warmup 100000 took 0.0003764610009966418
GPU warmup TOTAL time 0.0013992580061312765
GPU forward 1000 took 0.0003543390048434958
GPU forward 10000 took 0.0003633670130511746
GPU forward 100000 took 0.0004807310033356771
GPU forward 1000000 took 0.0005875999922864139
GPU forward 10000000 took 0.0016903509967960417
GPU forward 100000000 took 0.014400018990272656
GPU forward TOTAL time 0.0179396449966589
GPU for- & backward 1000 took 0.0006167769897729158
GPU for- & backward 10000 took 0.0006845899915788323
GPU for- & backward 100000 took 0.000631830989732407
GPU for- & backward 1000000 took 0.0010741150035755709
GPU for- & backward 10000000 took 0.0017265130009036511
GPU for- & backward 100000000 took 0.014847910992102697
GPU for- & backward TOTAL time 0.01965981800458394
```

### Code used for performance test
```
import torch
import torch.nn.functional as F
import torch.nn as nn

from timeit import default_timer

torch.manual_seed(0)
cpu = torch.device('cpu')
gpu = torch.device('cuda')

loss_fn = F.soft_margin_loss

def run_benchmark(name, depth, require_grad, device, fn):
    total_start = default_timer()
    for i in range(3, 3 + depth):
        start = default_timer()
        n = 10 ** i
        a = torch.rand(n, requires_grad=require_grad, device=device)
        b = torch.rand(n, device=device)
        fn(a, b)
        end = default_timer()
        print('{} {} took {}'.format(name, n, end-start))
    total_end = default_timer()
    print('{} TOTAL time {}'.format(name, total_end-total_start))

def fwd_only(a, b):
    out = loss_fn(a, b)

def fwd_bck(a, b):
    out = loss_fn(a, b)
    out.backward()

def sanity_check(name, device):
    print('{} Operator sanity check:'.format(name))
    a = torch.rand(10, requires_grad=True, device=device)
    b = torch.rand(10, device=device)
    out = loss_fn(a,b)
    print(out)
    out.backward()
    print(a.grad)
    print('double backward')
    loss = loss_fn(a, b)
    loss2 = torch.autograd.grad(loss, a, create_graph=True)
    z = loss2[0].sum()
    print(z)
    z.backward()
    print('ok')
    print()

print('PyTorch version:', torch.__version__)
sanity_check('CPU', cpu)
sanity_check('GPU', gpu)
print()

run_benchmark('CPU warmup', 3, False, cpu, fwd_only)
run_benchmark('CPU forward', 6, False, cpu, fwd_only)
run_benchmark('CPU for- & backward', 6, True, cpu, fwd_bck)
print()

run_benchmark('GPU warmup', 3, False, gpu, fwd_only)
run_benchmark('GPU forward', 6, False, gpu, fwd_only)
run_benchmark('GPU for- & backward', 6, True, gpu, fwd_bck)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27673

Differential Revision: D17889288

Pulled By: ezyang

fbshipit-source-id: 9ddffe4dbbfab6180847a8fec32443910f18f0a9
2019-10-15 08:44:57 -07:00
Xiaomeng Yang
8b87f9a510 Add fused layer norm impl on CUDA in PyTorch (#27634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27634

Add fused layer norm impl on CUDA in PyTorch

Performance benchmark compare to apex.FusedLayerNorm on a V100 machine.

**************************************
Shape = (128, 2097152)
  curr LayerNorm forward: 7.252584544941783ms
  apex LayerNorm forward: 10.366813436849043ms
  curr LayerNorm backward: 15.568048988003284ms
  apex LayerNorm backward: 20.869979876093566ms
**************************************
Shape = (256, 1048576)
  curr LayerNorm forward: 5.185673736967146ms
  apex LayerNorm forward: 6.3868385690730065ms
  curr LayerNorm backward: 13.942008479032665ms
  apex LayerNorm backward: 15.469660016940907ms
**************************************
Shape = (512, 524288)
  curr LayerNorm forward: 4.672068868065253ms
  apex LayerNorm forward: 4.717993081081659ms
  curr LayerNorm backward: 13.46354596503079ms
  apex LayerNorm backward: 14.04774487693794ms
**************************************
Shape = (1024, 262144)
  curr LayerNorm forward: 4.547273400006816ms
  apex LayerNorm forward: 5.378365494078025ms
  curr LayerNorm backward: 13.425063178874552ms
  apex LayerNorm backward: 14.235145597020164ms
**************************************
Shape = (2048, 131072)
  curr LayerNorm forward: 4.526399010093883ms
  apex LayerNorm forward: 4.775081946980208ms
  curr LayerNorm backward: 13.222738380078226ms
  apex LayerNorm backward: 13.59594238596037ms
**************************************
Shape = (4096, 65536)
  curr LayerNorm forward: 4.28789056581445ms
  apex LayerNorm forward: 4.48913648002781ms
  curr LayerNorm backward: 13.026655421825126ms
  apex LayerNorm backward: 13.57052089786157ms
**************************************
Shape = (8192, 32768)
  curr LayerNorm forward: 4.243518367875367ms
  apex LayerNorm forward: 4.34588153520599ms
  curr LayerNorm backward: 13.140627697808668ms
  apex LayerNorm backward: 13.49891544203274ms
**************************************
Shape = (16384, 16384)
  curr LayerNorm forward: 4.181216162163764ms
  apex LayerNorm forward: 4.268723972840235ms
  curr LayerNorm backward: 13.035593512002379ms
  apex LayerNorm backward: 13.463351831072941ms
**************************************
Shape = (32768, 8192)
  curr LayerNorm forward: 4.097899778978899ms
  apex LayerNorm forward: 4.109480210812762ms
  curr LayerNorm backward: 13.041268918896094ms
  apex LayerNorm backward: 13.586135944118723ms

Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "LayerNorm"

Reviewed By: houseroad

Differential Revision: D17462420

fbshipit-source-id: d4a67d160bb4eff73ffac64af46c56c3845cf211
2019-10-14 21:26:33 -07:00
Anjali Chourdia
da669c25ee autograd: double backwards function for binary_cross_entropy loss
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26983

Reviewed By: albanD

Differential Revision: D17714357

Pulled By: anjali411

fbshipit-source-id: cebfe09a9048c4be457b7f2718bc396c06ecabee
2019-10-04 08:29:22 -07:00
Vishwak Srinivasan
c643290982 Add derivative for cholesky_inverse (#26451)
Summary:
Changelog:

- Add derivative of cholesky_inverse. The equations are derived akin to the derivative of solve methods using the technique detailed [here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwiXrOjIyM7kAhWstlkKHRxqCDgQFjAAegQIAhAC&url=https%3A%2F%2Fpeople.maths.ox.ac.uk%2Fgilesm%2Ffiles%2FNA-08-01.pdf&usg=AOvVaw0BNISOvM_I9KjPrl0xv1R_)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26451

Test Plan:
- Added tests for cholesky_inverse in test_autograd.py

Closes https://github.com/pytorch/pytorch/issues/4669.

Differential Revision: D17548526

Pulled By: ezyang

fbshipit-source-id: 51aa8b900a8dc4012b01a73d432606f216f62c9d
2019-09-24 07:12:41 -07:00
Edward Yang
9b7011c5c2 Implement multiple dispatch (#26468) (#26501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D17499154

Pulled By: ezyang

fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c
2019-09-20 10:12:04 -07:00