Summary:
This one had a tricky usage of `torch.symeig` that had to be replaced. I tested the replacement locally though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57732
Reviewed By: bdhirsh
Differential Revision: D28328189
Pulled By: mruberry
fbshipit-source-id: 7f000fcbf2b029beabc76e5a89ff158b47977474
Summary:
Backward methods for `torch.lu` and `torch.lu_solve` require the `torch.lu_unpack` method.
However, while `torch.lu` is a Python wrapper over a native function, so its gradient is implemented via `autograd.Function`,
`torch.lu_solve` is a native function, so it cannot access `torch.lu_unpack` as it is implemented in Python.
Hence this PR presents a native (ATen) `lu_unpack` version. It is also possible to update the gradients for `torch.lu` so that backward+JIT is supported (no JIT for `autograd.Function`) with this function.
~~The interface for this method is different from the original `torch.lu_unpack`, so it is decided to keep it hidden.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46913
Reviewed By: albanD
Differential Revision: D28355725
Pulled By: mruberry
fbshipit-source-id: 281260f3b6e93c15b08b2ba66d5a221314b00e78
Summary:
This PR adds a note to the documentation that torch.svd is deprecated together with an upgrade guide on how to use `torch.linalg.svd` and `torch.linalg.svdvals` (Lezcano's instructions from https://github.com/pytorch/pytorch/issues/57549).
In addition, all usage of the old svd function is replaced with a new one from torch.linalg module, except for the `at::linalg_pinv` function, that fails the XLA CI build (https://github.com/pytorch/xla/issues/2755, see failure in draft PR https://github.com/pytorch/pytorch/pull/57772).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57981
Reviewed By: ngimel
Differential Revision: D28345558
Pulled By: mruberry
fbshipit-source-id: 02dd9ae6efe975026e80ca128e9b91dfc65d7213
Summary:
Backward methods for `torch.lu` and `torch.lu_solve` require the `torch.lu_unpack` method.
However, while `torch.lu` is a Python wrapper over a native function, so its gradient is implemented via `autograd.Function`,
`torch.lu_solve` is a native function, so it cannot access `torch.lu_unpack` as it is implemented in Python.
Hence this PR presents a native (ATen) `lu_unpack` version. It is also possible to update the gradients for `torch.lu` so that backward+JIT is supported (no JIT for `autograd.Function`) with this function.
~~The interface for this method is different from the original `torch.lu_unpack`, so it is decided to keep it hidden.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46913
Reviewed By: astaff
Differential Revision: D28117714
Pulled By: mruberry
fbshipit-source-id: befd33db12ecc147afacac792418b6f4948fa4a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50903
First part of #50010. Also fixes#51127.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D27911345
Pulled By: mruberry
fbshipit-source-id: 7138fddc935802918ab9ff19f4bc1b9f4d745d41
Summary:
As per discussion here https://github.com/pytorch/pytorch/pull/57127#discussion_r624948215
Note that we cannot remove the optional type from the `dim` parameter because the default is to flatten the input tensor which cannot be easily captured by a value other than `None`
### BC Breaking Note
This PR changes the `ord` parameter of `torch.linalg.vector_norm` so that it no longer accepts `None` arguments. The default behavior of `2` is equivalent to the previous default of `None`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57662
Reviewed By: albanD, mruberry
Differential Revision: D28228870
Pulled By: heitorschueroff
fbshipit-source-id: 040fd8055bbe013f64d3c8409bbb4b2c87c99d13
Summary:
Problem arises for sinc'(x) where x != 0, but x ** 2 == 0, which happens for some very small floats.
I realized that my solution from https://github.com/pytorch/pytorch/issues/56763 was incomplete when I did a quick implementation using `torch.autograd.Function` and still got a `NaN` from my derivative.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56986
Reviewed By: gchanan
Differential Revision: D28093507
Pulled By: albanD
fbshipit-source-id: 2a30e1065b08c5c60de843a0778dedeb0fb295f4
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54085
Fixes https://github.com/pytorch/pytorch/issues/50121.
This fixes two similar issues pointed out with the dtype that `torch.pow` performs its computation. Thanks ngimel for spotting the issues originally (comments [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594624355) and [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594719704))!
Before:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(131072)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([131072], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(131072, device='cuda:0')
```
After:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(0)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([0], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(0, device='cuda:0')
```
In all four cases above, `tensor(0, ...)` is the correct value because the computed "common dtype" among the inputs is expected to be `uint8`. Computing `2 ** 7` in uint8 will then overflow to zero. Finally, we cast the computed output to the output tensor's dtype, which is `int32`.
There were two separate issues fixed in this PR: one for cpu and one for cuda:
* For CPU, The `pow(Scalar, Tensor)` overload wasn't calling `set_wrapped_number(true)` after wrapping the scalar in a Tensor, which caused the "promoted" scalar to incorrectly participate in type promotion (see the documented behavior [here](aa8714dfed/c10/core/TensorImpl.h (L590)))
* For CUDA, the cuda kernels defined in `PowKernel.cu` were using the output's dtype to run the computation, instead of the common dtype.
As an aside: The CPU and CUDA kernels actually both use `iter.dtype()` instead of `iter.common_dtype()` to run the computation, which I fixed. The reason that only manifested here for CUDA is because TensorIterator has cpu-specific logic to create temporary outputs with the intermediate dtype (shown [here](aa8714dfed/aten/src/ATen/TensorIterator.cpp (L349))). I'm not sure what the end state is there- I can imagine that being something we're more okay doing for cpu than for cuda, but it also leads to hard-to-track-down inconsistencies between the two like in this case.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D27096330
Pulled By: bdhirsh
fbshipit-source-id: a7e2909243851625cb3056d1e7abb2383bfe95f2
Summary:
Converts loops of the form:
```
for(int64_t VAR=0;VAR<LIMIT;VAR++)
```
to the form
```
for(const auto VAR : c10::irange(LIMIT))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D27447811
fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51652.
In particular:
- the main implementation is in `torch.linalg.det` now. `torch.det` is just a deprecated alias to it
- add a new `OpInfo` for `torch.linalg.det`
- remove the old-style tests for `torch.det` (this is similar to what we did for `torch.linalg.slogdet`, see https://github.com/pytorch/pytorch/issues/49194)
- added a `out=` argument to `torch.linalg.det`, but **not** to `torch.det`.
It is worth noting that I had to skip few tests:
- `TestGradientsCuda::test_fn_gradgrad_linalg_det_cuda_float64`. This is not a regression: the functionality is broken also on master, but the test is not executed properly due to https://github.com/pytorch/pytorch/issues/53361.
And the following tests which fails only on ROCm:
- `test_variant_consistency_jit_cuda_{float64,float32}`
- `test_fn_grad_cuda_float64`
I think that the ROCm tests fail because the current linalg.det backward is unstable if the matrix has repeated singular values, see https://github.com/pytorch/pytorch/issues/53364 .
(At the moment of writing some CI jobs are still running but I believe the build will be green, since the only difference wrt the last push is the skip of the ROCm tests)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53119
Reviewed By: H-Huang
Differential Revision: D27441999
Pulled By: mruberry
fbshipit-source-id: 5eab14c4f0a165e0cf9ec626c3f4bb23359f2a9e
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53511
torch.det does depend on torch.prod, which in turn depends on several other functions, and they also depend on torch.prod, so there is a circular relationship, hence this PR will enable complex backward support for several functions at once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48125
Reviewed By: pbelevich
Differential Revision: D27188589
Pulled By: anjali411
fbshipit-source-id: bbb80f8ecb83a0c3bea2b917627d3cd3b84eb09a
Summary:
This PR adds autograd support for `torch.orgqr`.
Since `torch.orgqr` is one of few functions that expose LAPACK's naming and all other linear algebra routines were renamed a long time ago, I also added a new function with a new name and `torch.orgqr` now is an alias for it.
The new proposed name is `householder_product`. For a matrix `input` and a vector `tau` LAPACK's orgqr operation takes columns of `input` (called Householder vectors or elementary reflectors) scalars of `tau` that together represent Householder matrices and then the product of these matrices is computed. See https://www.netlib.org/lapack/lug/node128.html.
Other linear algebra libraries that I'm aware of do not expose this LAPACK function, so there is some freedom in naming it. It is usually used internally only for QR decomposition, but can be useful for deep learning tasks now when it supports differentiation.
Resolves https://github.com/pytorch/pytorch/issues/50104
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52637
Reviewed By: agolynski
Differential Revision: D27114246
Pulled By: mruberry
fbshipit-source-id: 9ab51efe52aec7c137aa018c7bd486297e4111ce
Summary:
Provides a faster formula for `cumprod` in the case when the input has zeros. This formula is non-differentiable, so we leave the previous formula for the cases when `at::GradMode::is_enabled()`.
This new formula gives up to x10 and x30 speed-ups in CPU and GPU (see the benchmarks below).
The `cumsum` backward formula was rewritten so that no copies are necessary. We also removed a double negation in its formula. This gives a significant speed-up in CPU, while being almost as efficient as the formula with copies in GPU. We can see this speed-up when comparing the "No zeros" part of the benchmark.
Benchmarks:
nb. It is worth noting that the script tests the forward and the backward for `cumprod`, so the speed-ups should be even larger than those announced here.
<details>
<summary>Script</summary>
```python
from IPython import get_ipython
import torch
from itertools import product
torch.manual_seed(13)
torch.set_num_threads(1)
ipython = get_ipython()
cpu = torch.device('cpu')
cuda = torch.device('cuda')
def run_test(ndims, size, size_prod, zeros, device):
print(f"ndims: {ndims}, tensor_size: {size}, size_prod: {size_prod}, zeros: {zeros}, device: {device}")
for dim in range(ndims):
sizes = ndims * [size]
sizes[dim] = size_prod
tensor = torch.rand(*sizes, device=device)
with torch.no_grad():
if zeros:
# Set 0.1 of them to zero
p_drop = 0.1
mask = torch.full_like(tensor, 1.0 - p_drop)
tensor = tensor * torch.bernoulli(mask)
else:
tensor = tensor + 1e-3
tensor.requires_grad_()
grad = torch.ones_like(tensor)
# We test both forward + backward, meaning that the speed-up is actually greater than reported
# That being said, this is more realistic than doing `retain_graph=True`
command = "torch.autograd.grad([tensor.cumprod(dim)], [tensor], grad_outputs=[grad])"
if device == cuda:
command += "; torch.cuda.synchronize()"
ipython.magic(f"timeit {command}")
print()
for device, zeros in product([cuda, cpu], [True, False]):
run_test(3, 300, 10, zeros, device)
run_test(3, 300, 100, zeros, device)
if device == cuda:
run_test(3, 300, 300, zeros, device)
```
</details>
<details>
<summary>CPU This PR (Some regression small tensors, x4 speed-up large tensors)</summary>
```
Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cpu
28.2 ms ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
29.8 ms ± 78.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
24.5 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cpu
414 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
428 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
382 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cpu
3.11 ms ± 9.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.83 ms ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.08 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cpu
92.2 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
101 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
87 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
</details>
<details>
<summary>CUDA This PR (7-30x speed-up)</summary>
```
Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cuda
1.46 ms ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.48 ms ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.93 ms ± 8.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cuda
10.5 ms ± 914 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.6 ms ± 509 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
11.7 ms ± 864 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
ndims: 3, tensor_size: 300, size_prod: 300, zeros: True, device: cuda
30.3 ms ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30.6 ms ± 6.44 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
32.2 ms ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cuda
248 µs ± 335 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
252 µs ± 186 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
438 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cuda
2.1 ms ± 193 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.16 ms ± 380 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.59 ms ± 398 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
ndims: 3, tensor_size: 300, size_prod: 300, zeros: False, device: cuda
6.3 ms ± 857 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.39 ms ± 288 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.15 ms ± 233 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
</details>
<details>
<summary>CPU master</summary>
```
Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cpu
8.27 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.8 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.2 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cpu
1.53 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.95 s ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.86 s ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cpu
3.42 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.25 ms ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.34 ms ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cpu
104 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
117 ms ± 99.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
94.8 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
</details>
<details>
<summary>CUDA master</summary>
```
Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: True, device: cuda
912 µs ± 431 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.05 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.74 ms ± 381 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: True, device: cuda
71.3 ms ± 7.91 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
85.4 ms ± 9.82 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
119 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
ndims: 3, tensor_size: 300, size_prod: 300, zeros: True, device: cuda
646 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
776 ms ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
917 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
No Zeros:
ndims: 3, tensor_size: 300, size_prod: 10, zeros: False, device: cuda
301 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
308 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
592 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
ndims: 3, tensor_size: 300, size_prod: 100, zeros: False, device: cuda
2.61 ms ± 375 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.68 ms ± 524 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.38 ms ± 736 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
ndims: 3, tensor_size: 300, size_prod: 300, zeros: False, device: cuda
7.89 ms ± 848 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.03 ms ± 517 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.24 ms ± 405 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
</details>
cc nikitaved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53711
Reviewed By: jbschlosser
Differential Revision: D27059662
Pulled By: anjali411
fbshipit-source-id: be610d5590c0199b4412dff66fac47666faaff9d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53583
`Scalar` takes 32 bytes due to `c10::complex<double>`
requires aligning to 16 bytes. Passing Scalar by reference
shows about 1% improvements on instruction count.
All the changes in this commit are codemoded except for
the following 4 files (which code-gen signatures):
```
tools/codegen/api/cpp.py
tools/codegen/api/native.py
tools/codegen/api/structured.py
caffe2/contrib/aten/gen_op.py
```
# Codemode
## Main Step
For the codemod part, here is the main command used:
```
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)optional<Scalar> (\w+)' '${1}const optional<Scalar>& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)optional<Scalar> (\w+)' '${1}const optional<Scalar>& ${2}'
```
As you can tell, it codemods both `Scalar` and `optional<Scalar>`. Apply these commands iteratively until reaching a fix-point (since one method signature might contain multiple `Scalar` parameter).
In retrospect, excluding `thrid_party` and `torch/csrc/jit` would be a good idea. (I revert it manually later, see https://github.com/pytorch/pytorch/pull/53479 as an reference).
## Pre-Step
Prior to applying the main command, as some `Scalar` are presented as `at::Scalar` or `c10::Scalar`, so I codemod some of them in advance. Here is an incomplete list:
```
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)at::Scalar (\w+)' '${1}const at::Scalar& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)at::Scalar (\w+)' '${1}const at::Scalar& ${2}'
fastmod --extensions h '([a-zA-Z_+]\([^)]*,?\s*)c10::optional<Scalar> (\w+)' '${1}const c10::optional<Scalar>& ${2}'
fastmod --extensions cpp '([a-zA-Z_+]\([^)]*,?\s*)c10::optional<Scalar> (\w+)' '${1}const c10::optional<Scalar>& ${2}'
```
## Fixup
There are a couple of post codemod fixup. For example, `const Scalar` will be codemoded into `const const Scalar&`. `at:Scalar` will be codemoded into `at::const Scalar&` (if `Pre-step` is not done comprehensively). Here is an incomplete list:
```
fastmod --extensions cpp 'const const Scalar' 'const Scalar'
fastmod --extensions h 'const const c10::optional<Scalar>' 'const c10::optional<Scalar>'
fastmod --extensions cpp 'const const c10::optional<Scalar>' 'const c10::optional<Scalar>'
fastmod 'at::const Scalar&' 'const at::Scalar&'
```
## Supplementary
`cu` and `mm` files also need to be codemoded, for example:
```
fastmod --extensions cu 'at::const Scalar&' 'const at::Scalar&'
fastmod --extensions mm '([a-zA-Z_+]\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
```
Function pointers are not codemoded. Here is an incomplete list:
```
# Cover case: using index_fill_fn = void(*)(TensorIterator & iter, int64_t dim, int64_t self_dim_size, int64_t self_dim_stride, Scalar source);
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar (\w+)' '${1}const Scalar& ${2}'
# Cover case: using softplus_fn = void (*)(TensorIterator&, Scalar, Scalar);
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar([, \)])' '${1}const Scalar&${2}'
fastmod --extensions cpp '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)Scalar([, \)])' '${1}const Scalar&${2}'
fastmod --extensions h '(void\s*\(\s*\*\s*\)\([^)]*,?\s*)optional<Scalar>([, \)])' '${1}const optional<Scalar>&${2}'
```
Some corner cases needs to be manually fixed.
ghstack-source-id: 123970306
Test Plan: Imported from OSS
Reviewed By: smessmer
Differential Revision: D26904445
fbshipit-source-id: 8d8a002af4b5125f153a32f03c6956be7ae5671d
Summary:
As per title. Compared to the previous version, it is lighter on the usage of `at::solve` and `at::matmul` methods.
Fixes https://github.com/pytorch/pytorch/issues/51621
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52875
Reviewed By: mrshenli
Differential Revision: D26768653
Pulled By: anjali411
fbshipit-source-id: aab141968d02587440128003203fed4b94c4c655
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49683
This PR solves Backward through sparse_coo_tensor bug by implementing a `sparse_mask_helper` function for n-dimensional sparse tensor for CPU and CUDA which is used to reimplement `sparse_constructor_values_backward` function.
This `sparse_mask` function was implemented before for backward sparse-sparse matmul. However, the algorithm is little different because in this case it should be applyable not only for matrices but for n-dimensional tensors. Thankfully it was not quite hard to extend and now both share the same code base.
Note that no new tests are required because now the backward for sparse-sparse matmul now uses the new `sparse_mask_helper`.
ngimel, mruberry - kindly review this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50361
Reviewed By: zhangguanheng66
Differential Revision: D26270483
Pulled By: ngimel
fbshipit-source-id: ee4bda49ff86e769342674b64d3c4bc34eae38ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50280
As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}`
argument so `torch.div` can be used as a replacement for `floor_divide` during
the transitional period.
I've included dedicated kernels for truncated and floor division which
aren't strictly necessary for float, but do perform significantly better (~2x) than
doing true division followed by a separate rounding kernel.
Note: I introduce new overloads for `aten::div` instead of just adding a default
`rounding_mode` because various JIT passes rely on the exact operator schema.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26123271
Pulled By: mruberry
fbshipit-source-id: 51a83717602114597ec9c4d946e35a392eb01d46
Summary:
**BC-breaking note:**
torch.svd() added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex "V" tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.
This will silently break all users of torch.svd() with complex inputs.
**Original PR Summary:**
This PR resolves https://github.com/pytorch/pytorch/issues/45821.
The problem was that when introducing the support of complex inputs for `torch.svd` it was overlooked that LAPACK/MAGMA returns the conjugate transpose of V matrix, not just the transpose of V. So `torch.svd` was silently returning U, S, V.conj() instead of U, S, V.
Behavior of `torch.linalg.pinv`, `torch.pinverse` and `torch.linalg.svd` (they depend on `torch.svd`) is not changed in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51012
Reviewed By: bdhirsh
Differential Revision: D26047593
Pulled By: albanD
fbshipit-source-id: d1e08dbc3aab9ce1150a95806ef3b5da98b5d3ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48719
Attempt to break this PR (https://github.com/pytorch/pytorch/pull/33019) into two parts. As per our discussion with eellison, the first part is to make sure our aten::slice operator take optional parameters for begin/step/end. This will help with refactoring ir_emitter.cpp for genering handling for list and slice striding. Once this PR merged, we will submit a second PR with compiler change.
Test Plan:
None for this PR, but new tests will be added for the second part.
Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D25929902
fbshipit-source-id: 5385df04e6d61ded0699b09bbfec6691396b56c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33884
Mitigates https://github.com/pytorch/pytorch/issues/5261.
It's not possible for us to support cudnn RNN double backwards due to
limitations in the cudnn API. This PR makes it so that we raise an error
message if users try to get the double backward on a cudnn RNN; in the
error message we suggest using the non-cudnn RNN.
Test Plan: - added some tests to check the error message
Reviewed By: albanD
Differential Revision: D20143544
Pulled By: zou3519
fbshipit-source-id: c2e49b3d8bdb9b34b561f006150e4c7551a78fac
Summary:
This PR adds `torch.linalg.slogdet`.
Changes compared to the original torch.slogdet:
- Complex input now works as in NumPy
- Added out= variant (allocates temporary and makes a copy for now)
- Updated `slogdet_backward` to work with complex input
Ref. https://github.com/pytorch/pytorch/issues/42666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49194
Reviewed By: VitalyFedyunin
Differential Revision: D25916959
Pulled By: mruberry
fbshipit-source-id: cf9be8c5c044870200dcce38be48cd0d10e61a48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50632
I'll port the following method tests in follow-up PRs:
`'baddbmm', 'addbmm', 'addmv', 'addr'`
After the tests are ported to OpInfo based tests, it would also be much easier to add tests with complex alpha and beta values.
Edit- it seems like it's hard to port the broadcasting variant tests because one ends up skipping `test_inplace_grad` and `test_variant_consistency_eager` even for the case when inputs are not required to be broadcasted.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D25947471
Pulled By: anjali411
fbshipit-source-id: 9faa7f1fd55a1269bad282adac2b39d19bfa4591
Summary:
Building on top of the work of anjali411 (https://github.com/pytorch/pytorch/issues/46640)
Things added in this PR:
1. Modify backward and double-backward formulas
2. Add complex support for `new module tests` and criterion tests (and add complex tests for L1)
3. Modify some existing tests to support complex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49912
Reviewed By: zhangguanheng66
Differential Revision: D25853036
Pulled By: soulitzer
fbshipit-source-id: df619f1b71c450ab2818eb17804e0c55990aa8ad
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47671
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49272
Test Plan:
```
x = torch.tensor([-2, -1, 0, 1, 2], dtype=torch.float32, requires_grad=True)
y = torch.nn.functional.elu_(x.clone(), alpha=-2)
grads = torch.tensor(torch.ones_like(y))
y.backward(grads)
```
```
RuntimeError: In-place elu backward calculation is triggered with a negative slope which is not supported.
This is caused by calling in-place forward function with a negative slope, please call out-of-place
version instead.
```
Reviewed By: albanD
Differential Revision: D25569839
Pulled By: H-Huang
fbshipit-source-id: e3c6c0c2c810261566c10c0cc184fd81b280c650
Summary:
As per title.
CC IvanYashchuk (unfortunately I cannot add you as a reviewer for some reason).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50109
Reviewed By: gchanan
Differential Revision: D25828536
Pulled By: albanD
fbshipit-source-id: 3791c3dd4f5c2a2917eac62e6527ecd1edcb400d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49138
See for details: https://fb.quip.com/QRtJAin66lPN
We need to model optional types explicitly, mostly for schema inference. So we cannot pass a `Tensor?[]` as `ArrayRef<Tensor>`, instead we need to pass it as an optional type. This PR changes it to `torch::List<c10::optional<Tensor>>`. It also makes the ops c10-full that were blocked by this.
## Backwards Compatibility
- This should not break the Python API because the representation in Python is the same and python_arg_parser just transforms the python list into a `List<optional<Tensor>>` instead of into a `List<Tensor>`.
- This should not break serialized models because there's some logic that allows loading a serialized `List<Tensor>` as `List<optional<Tensor>>`, see https://github.com/pytorch/pytorch/pull/49138/files#diff-9315f5dd045f47114c677174dcaa2f982721233eee1aa19068a42ff3ef775315R57
- This will break backwards compatibility for the C++ API. There is no implicit conversion from `ArrayRef<Tensor>` (which was the old argument type) to `List<optional<Tensor>>`. One common call pattern is `tensor.index({indices_tensor})`, where indices_tensor is another `Tensor`, and that will continue working because the `{}` initializer_list constructor for `List<optional<Tensor>>` can take `Tensor` elements that are implicitly converted to `optional<Tensor>`, but another common call pattern was `tensor.index(indices_tensor)`, where previously, the `Tensor` got implicitly converted to an `ArrayRef<Tensor>`, and to implicitly convert `Tensor -> optional<Tensor> -> List<optional<Tensor>>` would be two implicit conversions. C++ doesn't allow chaining. two implicit conversions. So those call sites have to be rewritten to `tensor.index({indices_tensor})`.
ghstack-source-id: 119269131
Test Plan:
## Benchmarks (C++ instruction counts):
### Forward
#### Script
```py
from torch.utils.benchmark import Timer
counts = Timer(
stmt="""
auto t = {{op call to measure}};
""",
setup="""
using namespace torch::indexing;
auto x = torch::ones({4, 4, 4});
""",
language="cpp",
).collect_callgrind(number=1_000)
print(counts)
```
#### Results
| Op call |before |after |delta | |
|------------------------------------------------------------------------|---------|--------|-------|------|
|x[0] = 1 |11566015 |11566015|0 |0.00% |
|x.index({0}) |6807019 |6801019 |-6000 |-0.09%|
|x.index({0, 0}) |13529019 |13557019|28000 |0.21% |
|x.index({0, 0, 0}) |10677004 |10692004|15000 |0.14% |
|x.index({"..."}) |5512015 |5506015 |-6000 |-0.11%|
|x.index({Slice(None, None, None)}) |6866016 |6936016 |70000 |1.02% |
|x.index({None}) |8554015 |8548015 |-6000 |-0.07%|
|x.index({false}) |22400000 |22744000|344000 |1.54% |
|x.index({true}) |27624088 |27264393|-359695|-1.30%|
|x.index({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})})|123472000|123463306|-8694|-0.01%|
### Autograd
#### Script
```py
from torch.utils.benchmark import Timer
counts = Timer(
stmt="""
auto t = {{op call to measure}};
""",
setup="""
using namespace torch::indexing;
auto x = torch::ones({4, 4, 4}, torch::requires_grad());
""",
language="cpp",
).collect_callgrind(number=1_000)
print(counts)
```
Note: the script measures the **forward** path of an op call with autograd enabled (i.e. calls into VariableType). It does not measure the backward path.
#### Results
| Op call |before |after |delta | |
|------------------------------------------------------------------------|---------|--------|-------|------|
|x.index({0}) |14839019|14833019|-6000| 0.00% |
|x.index({0, 0}) |28342019|28370019|28000| 0.00% |
|x.index({0, 0, 0}) |24434004|24449004|15000| 0.00% |
|x.index({"..."}) |12773015|12767015|-6000| 0.00% |
|x.index({Slice(None, None, None)}) |14837016|14907016|70000| 0.47% |
|x.index({None}) |15926015|15920015|-6000| 0.00% |
|x.index({false}) |36958000|37477000|519000| 1.40% |
|x.index({true}) |41971408|42426094|454686| 1.08% |
|x.index({"...", 0, true, Slice(1, None, 2), torch::tensor({1, 2})}) |168184392|164545682|-3638710| -2.16% |
Reviewed By: bhosmer
Differential Revision: D25454632
fbshipit-source-id: 28ab0cffbbdbdff1c40b4130ca62ee72f981b76d
Summary:
I am opening this PR early to have a place to discuss design issues.
The biggest difference between `torch.qr` and `numpy.linalg.qr` is that the former `torch.qr` takes a boolean parameter `some=True`, while the latter takes a string parameter `mode='reduced'` which can be one of the following:
`reduced`
this is completely equivalent to `some=True`, and both are the default.
`complete`
this is completely equivalent to `some=False`.
`r`
this returns only `r` instead of a tuple `(r, q)`. We have already decided that we don't want different return types depending on the parameters, so I propose to return `(r, empty_tensor)` instead. I **think** that in this mode it will be impossible to implement the backward pass, so we should raise an appropriate error in that case.
`raw`
in this mode, it returns `(h, tau)` instead of `(q, r)`. Internally, `h` and `tau` are obtained by calling lapack's `dgeqrf` and are later used to compute the actual values of `(q, r)`. The numpy docs suggest that these might be useful to call other lapack functions, but at the moment none of them is exposed by numpy and I don't know how often it is used in the real world.
I suppose the implementing the backward pass need attention to: the most straightforward solution is to use `(h, tau)` to compute `(q, r)` and then use the normal logic for `qr_backward`, but there might be faster alternatives.
`full`, `f`
alias for `reduced`, deprecated since numpy 1.8.0
`economic`, `e`
similar to `raw but it returns only `h` instead of `(h, tau). Deprecated since numpy 1.8.0
To summarize:
* `reduce`, `complete` and `r` are straightforward to implement.
* `raw` needs a bit of extra care, but I don't know how much high priority it is: since it is used rarely, we might want to not support it right now and maybe implement it in the future?
* I think we should just leave `full` and `economic` out, and possibly add a note to the docs explaining what you need to use instead
/cc mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47764
Reviewed By: ngimel
Differential Revision: D25708870
Pulled By: mruberry
fbshipit-source-id: c25c70a23a02ec4322430d636542041e766ebe1b