Summary:
Fixes: https://github.com/pytorch/pytorch/issues/26038
Somewhere between v1.1 and master `nonzero` become `abstract` and was marked as differentiable (by mistake) we need to but them into TH section of `tools/autograd/derivatives.yaml ` to fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26980
Differential Revision: D17632276
Pulled By: VitalyFedyunin
fbshipit-source-id: d6cabcc53348af6148cea5a1bd1af2ef12547373
Summary:
https://github.com/pytorch/pytorch/issues/24593https://github.com/pytorch/pytorch/issues/24727
**torch.lt(Tensor a, Tensor b)**
will compute common dtype (highest) based on inputs and then compare values. The result will be Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> x < y
tensor([True])
```
Previously it was impossible to make comparison of two tensors with different dtype.
**torch.lt(Tensor a, Tensor b, out=c)**
will compute common dtype (highest) based on inputs and then compare values. The result can be populated only to Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> z = torch.empty([1], dtype=torch.bool)
>>> torch.lt(x, y, out=z)
tensor([True])
```
Previously it was impossible to make comparison of two tensors with different dtype. Also previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result.
**a.lt_(Tensor b)**
Expects that a and b has same dtype, otherwise it's possible to get an overflow(Example: 'a' is uint8, 'b' is float32. 'a' will be promoted to float32 and the result will be also float32. Then it will be casted back to uint8 so potential for overflow). Will not compute common dtype. Result will have type of a.
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> x < y
tensor([True])
```
Works similar to previous implementation.
**torch.lt(Tensor a, Scalar b)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare.
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> x < 0.5
tensor([True])
>>> x = torch.tensor([0], dtype=torch.int)
>>> x < 0.5
tensor([True])
```
Fix https://github.com/pytorch/pytorch/issues/22301.
**torch.lt(Tensor a, Scalar b, out=c)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. The result can be populated only to Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> torch.lt(x, 0.5, out=z)
tensor([True])
```
Previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result. The rest works similar to previous implementation.
**torch.lt_(Tensor a, Scalar b)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. Result will have type of a.
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> x.lt_(1)
tensor([1], dtype=torch.int32)
>>> x = torch.tensor([0], dtype=torch.int)
>>> x.lt_(1.0)
tensor([1], dtype=torch.int32)
```
Works similar to previous implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25998
Differential Revision: D17431853
Pulled By: ifedan
fbshipit-source-id: b5effc6a5d9b32da379395b32abc628b604faaf7
Summary:
Currently when a Vec256<T> (base) object contains -0.0, Vec256<T>::abs()
would not produce 0.0, but -0.0 instead. This commit fixes this issue.
This bug will mostly affect CPUs without AVX support, such as ARM,
PowerPC, and older Intel models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26422
Differential Revision: D17607346
fbshipit-source-id: e8d4595f0e88ad93018a61f89b9e3dcada485358
Summary:
The current Bernoulli distribution sampler is slightly off in that it returns true slightly too often. This is most obvious at very low p values, like p = 0, although it theoretically occurs at every probability. See https://github.com/pytorch/pytorch/issues/26807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26864
Differential Revision: D17610459
Pulled By: ezyang
fbshipit-source-id: 28215ff820a6046822513f284793e7b850d38438
Summary:
Default encoding when using torch.load to 'utf-8'
This commit provides changes for cases where user tries to torch.load
a pickled module with non-ASCII characters in the docstring as
discussed in https://github.com/pytorch/pytorch/issues/21743. The default encoding was changed from 'ascii'
to 'utf-8'. Documentation for `torch.load` was updated and two tests
(loading py2 unicode module with unicode in it; error throwing when
user explicitly sets wrong encoding) were written.
~~This commit provides changes for better error handling in cases
where user tries to `torch.load` a pickled module with non-ASCII
characters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743.~~
Ping ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26421
Differential Revision: D17581633
Pulled By: yf225
fbshipit-source-id: f8e77dcf7907092771149aad8ede6cfb73c21620
Summary:
- Separates device type from default (test) device
- Adds multidevice decorator
- Updates generic tests to use multidevice decorator where applicable
TorchXLA wants to change the default test device based on the test environment. Separating the device type and the default (test) device enables that functionality.
Additionally, many existing tests only run on multiple devices and are required, as a consequence, to make CUDA-specific API calls. The multidevice decorator simplifies the existing code and limits the CUDA dependency. Eventually this should let us run multidevice tests on multiple device types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26594
Test Plan: tests were manually run with the CUDA test device set to 'cuda:1'.
Differential Revision: D17568910
Pulled By: mruberry
fbshipit-source-id: c442f748a31a970be8c21deb12a67c3b315c1128
Summary:
Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is also
integral and the scalar is positive to speed up.
Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz 0 0:0 3300.00 MHz , Debug
build, Turbo turned off):
```python
import timeit
for n, t in [(1000, 13000),
(10_000, 1300)]:
for e in (2, 3, 4):
for dtype in ('torch.int16', 'torch.int32', 'torch.int64'):
print(f'a.pow({e}) (a.numel() == {n}) for {t} times')
print(f'dtype {dtype}, {t} times', end='\t\t')
print(timeit.timeit(f'a.pow({e})',
setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})',
number=t))
```
Before:
```
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times 1.6958350749996498
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times 0.7989626339999631
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times 0.7973162800003593
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times 1.8660746679997828
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times 0.8101709959996697
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times 0.8135280149999744
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times 5.010833072999958
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times 4.801007671999741
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times 3.963344578000033
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times 1.6216251330001796
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times 0.5672429639998882
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times 0.5544572270000572
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times 1.656308512999658
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times 1.502670819999821
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times 0.5757876879997639
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times 4.775718216999849
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times 4.754745475000163
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times 3.737249878000057
```
After:
```
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times 1.1006453190002503
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times 1.0849009019998448
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times 1.093259106000005
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times 1.0859826279997833
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times 1.1076840900000207
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times 1.0755480369998622
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times 1.918211066999902
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times 1.9183043200000611
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times 1.930021430999659
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times 0.7271483560002707
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times 0.7289002070001516
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times 0.7267536800000016
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times 0.7301799359997858
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times 0.7289195180001116
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times 0.7270008230002531
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times 1.5354506029998447
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times 1.528263066999898
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times 1.5369428439998956
```
---
Best viewed with whitespace changes turned off
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26020
Differential Revision: D17485400
Pulled By: VitalyFedyunin
fbshipit-source-id: 3a16b074825a5aab0f7e7af3d8100f9e4b7011a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26620
This change updates torch.backend.quantized.engine to accept string ("fbgemm"/"qnnpack"/"none" for now).
set_qengine and get_qengine return an int which represents the at::QEngine enum
Test Plan:
python test/test_torch.py
Imported from OSS
Differential Revision: D17533582
fbshipit-source-id: 5103263d0d59ff37d43dec27243cb76ba8ba633f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26548
This makes the naming more consistent with PyTorch's API. The original
concern was that `tensor.rename` might make the operation seem like it
is in-place. However, we have many "verb" APIs: `tensor.add(other)`, for
example, doesn't add other to tensor in-place, but `tensor.add_(other)`
does.
`tensor.rename_` does exactly the same place as `tensor.rename`, but
in-place.
Test Plan: - [namedtensor ci]
Differential Revision: D17502021
Pulled By: zou3519
fbshipit-source-id: 6a5b93136a820075013cd1e30fb8fc6b9d77d7d9
Summary:
2 ^ 31 is 29, which is not a big number. Corrected to 2 ** 31.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26491
Differential Revision: D17494296
fbshipit-source-id: 83d320e8fb6d1b7df41e4474933a98107c8e4129
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501
Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.
XLA companion patch at https://github.com/pytorch/xla/pull/1031
Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'. I think this may be duplicated with some logic somewhere else but I have to double check.
The new generated code looks like this:
```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```
The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.
After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.
* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.
Benchmark:
Apply the following patch to the base commit and this commit:
```
diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
--- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+ return self;
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
--- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
dispatch:
CPU: im2col_backward_cpu
CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+ variants: function
+ dispatch:
+ CPU: _const5
```
Comparisons with timeit:
One-argument, representative case:
Before:
```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
After:
```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):
Before:
```
In [1]: import torch
In [2]: x = torch.zeros(1)
In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
After:
```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D17499154
Pulled By: ezyang
fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26498
We should allocate an empty tensor as a result tensor when performing
binary ops. Currently some ops use `empty_like(self)` as the initial
result tensor before passing it into TensorIterator. This is not very
efficient because TensorIterator may resize the tensor due to
broadcasting, causing more memory allocation. By using an empty tensor
as the result tensor, we only need to allocate/resize memory once as
opposed to twice.
Also fixes https://github.com/pytorch/pytorch/issues/26495. The bug
there is that the implementation of `pow` is missing a resize in one
case.
Test Plan:
- new test
- run tests
Differential Revision: D17500025
Pulled By: zou3519
fbshipit-source-id: bff4949af5e75541c04669b961bcf2e1ec456faf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468
Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.
XLA companion patch at https://github.com/pytorch/xla/pull/1031
Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'. I think this may be duplicated with some logic somewhere else but I have to double check.
The new generated code looks like this:
```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```
The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.
After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.
* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.
Benchmark:
Apply the following patch to the base commit and this commit:
```
diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
--- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+ return self;
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
--- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
dispatch:
CPU: im2col_backward_cpu
CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+ variants: function
+ dispatch:
+ CPU: _const5
```
Comparisons with timeit:
One-argument, representative case:
Before:
```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
After:
```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):
Before:
```
In [1]: import torch
In [2]: x = torch.zeros(1)
In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
After:
```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bddppq
Differential Revision: D17481256
Pulled By: ezyang
fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653
Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.
Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core. There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'. I think this may be duplicated with some logic somewhere else but I have to double check.
After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.
* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.
Benchmark:
Apply the following patch to the base commit and this commit:
```
diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
--- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+ return self;
+}
+
+}} // namespace at::native
diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
--- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
dispatch:
CPU: im2col_backward_cpu
CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+ variants: function
+ dispatch:
+ CPU: _const5
```
Comparisons with timeit:
One-argument, representative case:
Before:
```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
After:
```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):
Before:
```
In [1]: import torch
In [2]: x = torch.zeros(1)
In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
After:
```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D17265918
Pulled By: ezyang
fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d
Summary:
- Moves all ROCm-requiring test_torch tests to TestTorchDeviceType
- Moves test_stft and test_lu from test_cuda
- Moves many CUDA-only test_torch tests to TestTorchDeviceType
- Combines several test_torch CPU tests with their CUDA variants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26435
Differential Revision: D17470469
Pulled By: mruberry
fbshipit-source-id: 90bb7fc09465c53eb2ab8da52eb2c2509775c16f
Summary:
- Adds dtypes, dtypesIfCPU, and dtypesIfCUDA decorators.
- Eliminates the need for nontest members to be defined in an inherited base.
- Updates one test to use the decorators and updates TestTorchDeviceType with helpers.
This PR appears to be hanging the ROCm build, which is not entirely surprising. See https://github.com/pytorch/pytorch/issues/26394, which demonstrates that the ROCm build can be hung by commenting out a Python test that was never run on ROCm.
gchanan - what type list, if any, do you want to expose? I imagine most test suites will define their own lists like today. SCALAR_TYPES, QUANTIZED_TYPES, and ALL_TYPES seem reasonable to me. DOCUMENTED_TENSOR_TYPES will be removed, of course.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26375
Test Plan: Edit is to tests themselves.
Differential Revision: D17462294
Pulled By: mruberry
fbshipit-source-id: f8259ec66709749b1bf8077efc737676af901436
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25658
This unflattens `dim` according to the shape specified in `namedshape`.
`namedshape` may be either an OrderedDict or an iterable of (name, size)
tuples.
Future:
- It is possible to make it take a dict in Python >= 3.6 because those are
ordered by default, but I'll leave that task for the future.
Test Plan: - new tests [namedtensor ci]
Differential Revision: D17192655
Pulled By: zou3519
fbshipit-source-id: fd9bd2f462c23a4df1c23d66f2aa95076ff1b160
Summary:
Changelog:
- Modify existing implementation of pinverse to support batching on inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26095
Test Plan: - Added tests in test_pinverse to test batched implementation
Differential Revision: D17408092
Pulled By: soumith
fbshipit-source-id: bba95eb193ce33a94ecfaf74da270d34b435e4af
Summary:
- Adds new decorators for skipping on ROCm, skipping on MKL, running only on the CPU and running only on CUDA
- Makes decorator skip semantics consistent
- Adds CUDA default stream requirement to MAGMA decorator
- Creates TestAutogradDeviceType
Note this PR originally moved test_cdist, but moving it caused failures in CI. There may be an undiagnosed issue with cdist or the test. The issue does not reproduce locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26248
Test Plan: Change is to tests themselves.
Differential Revision: D17410386
Pulled By: mruberry
fbshipit-source-id: 8459df44f2a00f0e71680fbe713587a01d4b0300
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26252
Original commit changeset: 1375774f24c2
Testing to see if this is somehow the source of hangs on ROCm builds.
Test Plan: Change is to tests themselves. This diff is for testing the ROCm hang, however.
Differential Revision: D17390575
fbshipit-source-id: a6ffd5eb1df3971b99b6d42271a8d3d501ac79c6
Summary:
- Adds SkipCUDAIfRocm and skipCPUIfNoMkl decorators, ports corresponding tests
- Changes "SkipIf" input semantics for consistency
- Removes torchtest, which has been replaced with this new generic framework
- Refactors some common parts out of CUDA tests to TestTorchDeviceType
- Ensures all MAGMA tests run on default stream by putting the skipCUDANonDefaultStreamIf in the skipCUDAIfNoMagma decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26244
Differential Revision: D17389060
Pulled By: mruberry
fbshipit-source-id: 1375774f24c2266049e6d4b899e7300ddf32eac8
Summary:
This PR moves many tests in test_torch.py to the generic device type framework. This means that many CUDA tests now run in test_torch.py and there is greater consistency in how tests for many device types are written.
One change is that all MAGMA tests are run on the default stream due to intermittent instability running MAGMA on the non-default stream. This is a known issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26232
Test Plan:
While this PR edits the tests itself, it was validated using two independent methods:
(1) The code was reviewed and it was verified that all deleted functions were actually moved.
(2) The output of the TestTorch CI was reviewed and test outputs were matched before and after this PR.
Differential Revision: D17386370
Pulled By: mruberry
fbshipit-source-id: 843d14911bbd52e8aac6861c0d9bc3d0d9418219
Summary:
This PR addresses https://github.com/pytorch/pytorch/issues/24851 by...
1. lets device types easily register themselves for testing
2. lets tests be written to run on multiple devices and with multiple dtypes
3. provides a mechanism to instantiate those tests so they are discoverable and filterable by unittest and pytest
It refactors three tests from test_torch.py to demonstrate how to use it.
`test_diagonal` is the simplest example. Most tests just need to be modified to accept 'device' as an argument. The framework will then instantiate `test_diagonal_cpu` and `test_diagonal_cuda` (when CUDA is available) which call `test_diagonal` with the appropriate 'device' argument.
`test_neg` also has dtype variants. It accepts both 'device' and 'dtype' as arguments, and the dtypes it runs with are specified with the 'dtypes' decorator. Dtypes can be specified for all device types and particular device types. The framework instantiates tests like `test_neg_cpu_torch.float`.
`test_inverse` has device-specific dependencies. These dependencies are expressed with the sugary 'skipCUDAIfNoMagma' and 'skipCPUIfNoLapack' decorators. These decorators are device-specific so CPU testing is not skipped if Magma is not installed, and there conditions may be checked after or before the test case has been initialized. This means that skipCUDAIfNoMagma does not initialize CUDA. In fact, CUDA is only initialized if a CUDA test is run.
These instantiated tests may be run as usual and with pytest filtering it's easy to run one test on all device types, run all the tests for a particular device type, or run a device type and dtype combination.
See the note "Generic Device-Type Testing" for more detail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25967
Differential Revision: D17381987
Pulled By: mruberry
fbshipit-source-id: 4a639641130f0a59d22da0efe0951b24b5bc4bfb
Summary:
Because of 'return NotImplemented', __contains__ return True when the element is not a number.
bool(NotImplemented) == True
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24156
Differential Revision: D16829895
Pulled By: zou3519
fbshipit-source-id: 9d3d58025b2b78b33a26fdfcfa6029d0d049f11f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25843
`tensor.align_to(*names)` permutes the dimensions of `tensor` and adds
additional 1-sized dimensions such that the output tensor has dimensions
in the same order as `names`. All dimensions of `tensor` must be
present in `names`, in addition, this function requires that all dims of
`tensor` be named.
`tensor.align_as(other)` is equivalent to
`tensor.align_to(*other.names)`.
I'm planning on changing `torch.align_tensors(*tensors)` to align closer
to these semantics because there didn't seem to be a clear use case for the old
semantics that preserve unnamed dimensions. That will come in a future
change.
Test Plan: - new tests [namedtensor ci]
Differential Revision: D17255549
Pulled By: zou3519
fbshipit-source-id: 1e437ad81e9359b4d5bd0e7e64c3a1be441fc3e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25842
`tensor.refine_names(*names)` takes `tensor` and attempts to name its
dimensions `names` out-of-place. If a dimension `i` already had a name,
then it cannot be changed (so tensor.names[i] must equal names[i]);
if the original dimension did not have a name, then the new name
(names[i]) can be anything.
`tensor.refine_names(*names)` also accepts a glob '*' that greedily selects
names from `tensor`. Here are some examples:
- `Tensor[None].refine_names('N') -> Tensor[N]`
- `Tensor[N].refine_names('N') -> Tensor[N]`
- `Tensor[N].refine_names('D') -> Error!`
- `Tensor[N].refine_names(None) -> Error!`
- `Tensor[None, None].refine_names('*', D) -> Tensor[None, D]`
Test Plan: - new tests [namedtensor ci]
Differential Revision: D17255548
Pulled By: zou3519
fbshipit-source-id: fdbdb3a12f24fbe37ce1e53ed09dc8a42589d928
Summary:
Changelog:
- De-duplicate the code in tests for torch.solve, torch.cholesky_solve, torch.triangular_solve
- Skip tests explicitly if requirements aren't met for e.g., if NumPy / SciPy aren't available in the environment
- Add generic helpers for these tests in test/common_utils.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25733
Test Plan:
- All tests should pass to confirm that the change is not erroneous
Clears one point specified in the discussion in https://github.com/pytorch/pytorch/issues/24333.
Differential Revision: D17315330
Pulled By: zou3519
fbshipit-source-id: c72a793e89af7e2cdb163521816d56747fd70a0e