Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955
resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`
This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23460526
Pulled By: anjali411
fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
Summary:
These alias are consistent with NumPy. Note that C++'s naming would be different (std::multiplies and std::divides), and that PyTorch's existing names (mul and div) are consistent with Python's dunders.
This also improves the instructions for adding an alias to clarify that dispatch keys should be removed when copying native_function.yaml entries to create the alias entries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44463
Reviewed By: ngimel
Differential Revision: D23670782
Pulled By: mruberry
fbshipit-source-id: 9f1bdf8ff447abc624ff9e9be7ac600f98340ac4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393
torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23649613
Pulled By: heitorschueroff
fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936
Summary:
This PR:
- updates div to perform true division
- makes torch.true_divide an alias of torch.div
This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907
Reviewed By: ngimel
Differential Revision: D23622114
Pulled By: mruberry
fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44562
Add a note that torch.median returns the smaller of the two middle elements for even-sized input and refer user to torch.quantile for the mean of the middle values.
fixes https://github.com/pytorch/pytorch/issues/39520
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D23657208
Pulled By: heitorschueroff
fbshipit-source-id: 2747aa652d1e7f10229d9299b089295aeae092c2
Summary:
This fixes a `katex` error I was getting trying to build the docs:
```
ParseError: KaTeX parse error: Undefined control sequence: \0 at position 55: …gin{cases}
```
This failure was introduced in https://github.com/pytorch/pytorch/issues/42523.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44481
Reviewed By: colesbury
Differential Revision: D23627700
Pulled By: mruberry
fbshipit-source-id: 9cc09c687a7d9349da79a0ac87d6c962c9cfbe2d
Summary:
**BC-breaking note**
This change is BC-breaking for C++ callers of linspace and logspace if they were providing a steps argument that could not be converted to an optional.
**PR note**
This PR deprecates calling linspace and logspace wihout setting steps explicitly by:
- updating the documentation to warn that not setting steps is deprecated
- warning (once) when linspace and logspace are called without steps being specified
A test for this behavior is added to test_tensor_creation_ops. The warning only appears once per process, however, so the test would pass even if no warning were thrown. Ideally there would be a mechanism to force all warnings, include those from TORCH_WARN_ONCE, to trigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43860
Reviewed By: izdeby
Differential Revision: D23498980
Pulled By: mruberry
fbshipit-source-id: c48d7a58896714d184cb6ff2a48e964243fafc90
Summary:
1) Ports nonzero from THC to ATen
2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU
3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero
4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust
4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndim*nelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway.
Benchmarking is done for tensors with approximately half non-zeros
<details><summary>Benchmarking script</summary>
<p>
```
import torch
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
device = "cuda"
results = []
for numel in (1024 * 128,):#, 1024 * 1024, 1024 * 1024 * 128):
inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float)
for ndim in range(2,3):#(1,4):
if ndim == 1:
shape = (numel,)
elif ndim == 2:
shape = (1024, numel // 1024)
else:
shape = (1024, 128, numel // 1024 // 128)
inp = inp.reshape(shape)
repeats = 3
timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}",
description = f"ndim {ndim}", globals=globals())
for i in range(repeats):
results.append(timer.blocked_autorange())
print(f"\rnumel {numel} ndim {ndim}", end="")
sys.stdout.flush()
comparison = Compare(results)
comparison.print()
```
</p>
</details>
### Results
Before:
```
[--------------------------- Nonzero ---------------------------]
| ndim 1 | ndim 2 | ndim 3
1 threads: ------------------------------------------------------
number of elts 131072 | 55.2 | 71.7 | 90.5
number of elts 1048576 | 113.2 | 250.7 | 497.0
number of elts 134217728 | 8353.7 | 23809.2 | 54602.3
Times are in microseconds (us).
```
After:
```
[-------------------------- Nonzero --------------------------]
| ndim 1 | ndim 2 | ndim 3
1 threads: ----------------------------------------------------
number of elts 131072 | 48.6 | 79.1 | 90.2
number of elts 1048576 | 64.7 | 134.2 | 161.1
number of elts 134217728 | 3748.8 | 7881.3 | 9953.7
Times are in microseconds (us).
```
There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259
Reviewed By: izdeby
Differential Revision: D23581955
Pulled By: ngimel
fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc
Summary:
When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR:
- Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction
- Fixes var's docs, which listed its arguments in the incorrect order
- Adds new tests comparing var and std with their NumPy counterparts
Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints:
- torch.randn((8000, 8000))
- var measured 0.0022215843200683594s on CUDA before the change
- var measured 0.0020322799682617188s on CUDA after the change
- torch.randn((8000, 8000)).T
- var measured .015128850936889648 on CUDA before the change
- var measured 0.001912832260131836 on CUDA after the change
- torch.randn(8000 ** 2)
- std measured 0.11031460762023926 on CUDA before the change
- std measured 0.0017833709716796875 on CUDA after the change
Timings for var and std are, as expected, similar.
On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change:
```
import torch
import numpy as np
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
base = 8
multiplier = 1
def stdfn(a):
meanv = a.mean()
ac = a-meanv
return torch.sqrt(((ac*ac).sum())/a.numel())
results = []
num_threads=1
for _ in range(7):
size = base*multiplier
input = torch.randn(size)
tasks = [("torch.var(input)", "torch_var"),
("torch.var(input, dim=0)", "torch_var0"),
("stdfn(input)", "stdfn"),
("torch.sum(input, dim=0)", "torch_sum0")
]
timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}",
description=label, globals=globals()) for stmt, label in tasks]
repeats = 3
for i, timer in enumerate(timers * repeats):
results.append(
timer.blocked_autorange()
)
print(f"\r{i + 1} / {len(timers) * repeats}", end="")
sys.stdout.flush()
multiplier *=10
print()
comparison = Compare(results)
comparison.print()
```
The TH timings using this script on my devfair are:
```
[------------------------------ Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ----------------------------------------------------------
8 | 16.0 | 5.6 | 40.9 | 5.0
80 | 15.9 | 6.1 | 41.6 | 4.9
800 | 16.7 | 12.0 | 42.3 | 5.0
8000 | 27.2 | 72.7 | 51.5 | 6.2
80000 | 129.0 | 715.0 | 133.0 | 18.0
800000 | 1099.8 | 6961.2 | 842.0 | 112.6
8000000 | 11879.8 | 68948.5 | 20138.4 | 1750.3
```
and the ATen timings are:
```
[------------------------------ Index ------------------------------]
| torch_var | torch_var0 | stdfn | torch_sum0
1 threads: ----------------------------------------------------------
8 | 4.3 | 5.4 | 41.4 | 5.4
80 | 4.9 | 5.7 | 42.6 | 5.4
800 | 10.7 | 11.7 | 43.3 | 5.5
8000 | 69.3 | 72.2 | 52.8 | 6.6
80000 | 679.1 | 676.3 | 129.5 | 18.1
800000 | 6770.8 | 6728.8 | 819.8 | 109.7
8000000 | 65928.2 | 65538.7 | 19408.7 | 1699.4
```
which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too:
```
import torch
import time
# Benchmarking var and std, 1D with varying sizes
base = 8
multiplier = 1
op = torch.var
reps = 1000
for _ in range(7):
size = base * multiplier
t = torch.randn(size)
elapsed = 0
for _ in range(reps):
start = time.time()
op(t)
end = time.time()
elapsed += end - start
multiplier *= 10
print("Size: ", size)
print("Avg. elapsed time: ", elapsed / reps)
```
```
var cpu TH vs ATen timings
Size: 8
Avg. elapsed time: 1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins)
Size: 80
Avg. elapsed time: 1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins)
Size: 800
Avg. elapsed time: 1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins)
Size: 8000
Avg. elapsed time: 2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins)
Size: 80000
Avg. elapsed time: 0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins)
Size: 800000
Avg. elapsed time: 0.0010556647777557374 vs 0.00030616092681884767 (ATen wins)
Size: 8000000
Avg. elapsed time: 0.009990205764770508 vs 0.002938544034957886 (ATen wins)
std cpu TH vs ATen timings
Size: 8
Avg. elapsed time: 1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins)
Size: 80
Avg. elapsed time: 1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins)
Size: 800
Avg. elapsed time: 1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins)
Size: 8000
Avg. elapsed time: 2.7791500091552735e-05 vs 7.031106948852539e-05 (TH wins)
Size: 80000
Avg. elapsed time: 0.00018650460243225096 vs 0.00024368906021118164 (TH wins)
Size: 800000
Avg. elapsed time: 0.0010522041320800782 vs 0.0003039860725402832 (ATen wins)
Size: 8000000
Avg. elapsed time: 0.009976618766784668 vs. 0.0029211788177490234 (ATen wins)
```
These results show the TH solution still performs better than the ATen solution with default threading for some sizes.
It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858
Reviewed By: zou3519
Differential Revision: D23498981
Pulled By: mruberry
fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050
Summary:
This PR adds the following aliaes:
- not_equal for torch.ne
- greater for torch.gt
- greater_equal for torch.ge
- less for torch.lt
- less_equal for torch.le
This aliases are consistent with NumPy's naming for these functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43870
Reviewed By: zou3519
Differential Revision: D23498975
Pulled By: mruberry
fbshipit-source-id: 78560df98c9f7747e804a420c1e53fd1dd225002
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270
`torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ).
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D23460493
Pulled By: anjali411
fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9
Summary:
`torch.range` still hasn't been removed way after version 0.5. This PR fixes the warning message. Alternatively, we can remove `torch.range`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43569
Reviewed By: ngimel
Differential Revision: D23408233
Pulled By: mruberry
fbshipit-source-id: 86c4f9f018ea5eddaf80b78a3c54dfa41cfc6fa6
Summary:
Adds two more "missing" NumPy aliases: arctanh and arcsinh, and simplifies the dispatch of other arc* aliases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43762
Reviewed By: ngimel
Differential Revision: D23396370
Pulled By: mruberry
fbshipit-source-id: 43eb0c62536615fed221d460c1dec289526fb23c
Summary:
Add a max/min operator that only return values.
## Some important decision to discuss
| **Question** | **Current State** |
|---------------------------------------|-------------------|
| Expose torch.max_values to python? | No |
| Remove max_values and only keep amax? | Yes |
| Should amax support named tensors? | Not in this PR |
## Numpy compatibility
Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html
| Parameter | PyTorch Behavior |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `axis`: None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. | Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137) |
| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output. | Same |
| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array. | implemented as `keepdim` |
| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice. | Not implemented in this PR. Better to implement for all reductions in the future. |
| `where`: array_like of bool, optional. Elements to compare for the maximum. | Not implemented in this PR. Better to implement for all reductions in the future. |
**Note from numpy:**
> NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.
PyTorch has the same behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092
Reviewed By: ngimel
Differential Revision: D23360705
Pulled By: mruberry
fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d
Summary:
To reduce the chance of conflicts, not all ops are fixed. Ops starting with letter `f` will be fixed in separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43583
Reviewed By: ZolotukhinM
Differential Revision: D23330347
Pulled By: mruberry
fbshipit-source-id: 3387cb1e495faebd16fb183039197c6d90972ad4
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349
Implement NumPy-like functions `maximum` and `minimum`.
The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima.
If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs.
This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579
Reviewed By: mrshenli
Differential Revision: D23153081
Pulled By: mruberry
fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1
Summary:
This PR:
- Adds a method variant to movedim
- Fixes the movedim docs so it will actually appear in the documentation
- Fixes three view doc links which were broken
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43122
Reviewed By: ngimel
Differential Revision: D23166222
Pulled By: mruberry
fbshipit-source-id: 14971585072bbc04b5366d4cc146574839e79cdb
Summary:
This adds the torch.arccosh alias and updates alias testing to validate the consistency of the aliased and original operations. The alias testing is also updated to run on CPU and CUDA, which revealed a memory leak when tracing (see https://github.com/pytorch/pytorch/issues/43119).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43107
Reviewed By: ngimel
Differential Revision: D23156472
Pulled By: mruberry
fbshipit-source-id: 6155fac7954fcc49b95e7c72ed917c85e0eabfcd
Summary:
https://github.com/pytorch/pytorch/issues/40980
I have a few questions during implementing Polygamma function...
so, I made PR prior to complete it.
1. some code blocks brought from cephes library(and I did too)
```
/*
* The following function comes with the following copyright notice.
* It has been released under the BSD license.
*
* Cephes Math Library Release 2.8: June, 2000
* Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier
*/
```
is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases)
2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md)
How do I'm sure my code will follow appropriate guidelines of this library..?
3. Actually, there's a digamma, trigamma function already
digamma is needed, however, trigamma function becomes redundant if polygamma function is added.
it is okay for trigamma to be there or should be removed?
btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499
Reviewed By: gchanan
Differential Revision: D23110016
Pulled By: albanD
fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e
Summary:
This PR:
- updates test_op_normalization.py, which verifies that aliases are correctly translated in the JIT
- adds torch.linalg.det as an alias for torch.det
- moves the torch.linalg.outer alias to torch.outer (to be consistent with NumPy)
The torch.linalg.outer alias was put the linalg namespace erroneously as a placeholder since it's a "linear algebra op" according to NumPy but is actually still in the main NumPy namespace.
The updates to test_op_normalization are necessary. Previously it was using method_tests to generate tests, and method_tests assumes test suites using it also use the device generic framework, which test_op_normalization did not. For example, some ops require decorators like `skipCPUIfNoLapack`, which only works in device generic test classes. Moving test_op_normalization to the device generic framework also lets these tests run on CPU and CUDA.
Continued reliance on method_tests() is excessive since the test suite is only interested in testing aliasing, and a simpler and more readable `AliasInfo` class is used for the required information. An example impedance mismatch between method_tests and the new tests, for example, was how to handle ops in namespaces like torch.linalg.det. In the future this information will likely be folded into a common 'OpInfo' registry in the test suite.
The actual tests performed are similar to what they were previously: a scripted and traced version of the op is run and the test verifies that both graphs do not contain the alias name and do contain the aliased name.
The guidance for adding an alias has been updated accordingly.
cc mattip
Note:
ngimel suggests:
- deprecating and then removing the `torch.ger` name
- reviewing the implementation of `torch.outer`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42802
Reviewed By: zou3519
Differential Revision: D23059883
Pulled By: mruberry
fbshipit-source-id: 11321c2a7fb283a6e7c0d8899849ad7476be42d1