Commit Graph

1776 Commits

Author SHA1 Message Date
Xiang Gao
903acc6b83 CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247)
Summary:
Add CUDA BFloat16 support of clamp, remainder, lshift, rshift

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247

Reviewed By: dzhulgakov

Differential Revision: D24174258

Pulled By: ngimel

fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638
2020-10-07 20:37:06 -07:00
Vaidotas Simkus
e154b36685 Standardized clamp kernels to Numpy-like implementation (#43288)
Summary:
**BC-breaking note**

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)

but in other places it clamps differently:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)

78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)

These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

```
t = torch.arange(200).to(torch.float)
torch.clamp(t, 4, 2)[0]
: tensor(2.)

torch.clamp(t.cuda(), 4, 2)[0]
: tensor(4., device='cuda:0')

torch.clamp(torch.tensor(0), 4, 2)
: tensor(4)
```

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.

**PR Summary**

Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.

The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288

Reviewed By: colesbury

Differential Revision: D24079453

Pulled By: mruberry

fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7
2020-10-06 13:42:08 -07:00
KyleCZH
a9a9d0b181 Rocm skip test cases (#45782)
Summary:
Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA)
- test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)
- test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA)
- test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest)
jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782

Reviewed By: VitalyFedyunin

Differential Revision: D24115581

Pulled By: xw285cornell

fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5
2020-10-05 15:12:25 -07:00
Xiang Gao
e1ff46b6e5 CUDA BFloat16 TopK (#44755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755

Reviewed By: mruberry

Differential Revision: D23741680

Pulled By: ngimel

fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0
2020-10-04 11:38:00 -07:00
Nikita Shulga
3a27fc966a Test torch.svd using complex float and double numbers (take 2) (#45795)
Summary:
Adds support for magmaSvd for complex numbers

Fixes use-after-free error in `apply_symeig`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795

Reviewed By: ezyang

Differential Revision: D24096955

Pulled By: malfet

fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0
2020-10-03 11:33:28 -07:00
Nikita Shulga
5a47a2126d Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers
Test Plan: revert-hammer

Differential Revision:
D24018160 (888f3c12e7)

Original commit changeset: 1b6103f5af94

fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9
2020-10-02 13:33:11 -07:00
Nikita Shulga
888f3c12e7 Test torch.svd using complex float and double numbers (#45572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572

Reviewed By: anjali411

Differential Revision: D24018160

Pulled By: malfet

fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34
2020-10-02 08:29:14 -07:00
Ivan Yashchuk
77cd8e006b Added support for complex torch.symeig (#45121)
Summary:
This PR adds support for complex-valued input for `torch.symeig`.

TODO:
- [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat`
Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work.

Fixes https://github.com/pytorch/pytorch/issues/45061.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121

Reviewed By: mrshenli

Differential Revision: D24049649

Pulled By: anjali411

fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5
2020-10-01 08:57:13 -07:00
Nikita Shulga
c87ff2cb90 Enable transposed tensor copy for complex types (#45487)
Summary:
This enables a special copy operator for transposed tensors with more than 360 elements:
417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)

Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))"

Fixes https://github.com/pytorch/pytorch/issues/45269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487

Reviewed By: anjali411

Differential Revision: D23984441

Pulled By: malfet

fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f
2020-09-29 19:22:05 -07:00
Mike Ruberry
b66ac1e928 Updates nonzero's as_tuple behavior to no longer warn. (#45413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44284.

[torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413

Reviewed By: ngimel

Differential Revision: D23975015

Pulled By: mruberry

fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc
2020-09-29 12:16:59 -07:00
Mike Ruberry
b2925671b6 Updates deterministic flag to throw a warning, makes docs consistent (#45410)
Summary:
Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410

Reviewed By: ngimel

Differential Revision: D23974988

Pulled By: mruberry

fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d
2020-09-29 11:17:33 -07:00
Hong Xu
15f85eea18 Support bfloat16 and complex dtypes for logical_not (#43537)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751950

Pulled By: mruberry

fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb
2020-09-29 11:00:05 -07:00
Mike Ruberry
6d37126a10 Makes rdiv consistent with div (#45407)
Summary:
In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407

Reviewed By: ngimel

Differential Revision: D23974967

Pulled By: mruberry

fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95
2020-09-29 08:34:01 -07:00
Himangshu
7cde662f08 Add check for Complex Type to allow non integral alpha. (#45200)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200

Reviewed By: gchanan

Differential Revision: D23940134

Pulled By: anjali411

fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139
2020-09-29 07:36:46 -07:00
anjali411
534f2ae582 Disable inplace abs for complex tensors (#45069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069

`torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input.

Test Plan: Imported from OSS

Reviewed By: glaringlee, malfet

Differential Revision: D23818397

Pulled By: anjali411

fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2
2020-09-28 20:33:35 -07:00
Xiong Wei
0c8a6008ac Fix torch.pow when the scalar base is a complex number (#45259)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259

Reviewed By: gchanan

Differential Revision: D23962073

Pulled By: anjali411

fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72
2020-09-28 18:25:53 -07:00
Xiang Gao
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
Mike Ruberry
8bdbedd4ee Revert "Updates and simplifies nonzero as_tuple behavior"
This reverts commit 8b143771d0.
2020-09-27 20:58:42 -07:00
Mike Ruberry
8b143771d0 Updates and simplifies nonzero as_tuple behavior 2020-09-27 20:56:30 -07:00
Xiong Wei
241afc9188 Migrate addr from the TH to Aten (CPU) (#44364)
Summary:
Related https://github.com/pytorch/pytorch/issues/24507
Fixes https://github.com/pytorch/pytorch/issues/24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
2020-09-25 01:18:09 -07:00
Gao, Xiang
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
Hong Xu
b470fa4500 Add complex number support for binary logical operators (#43174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684425

Pulled By: mruberry

fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330
2020-09-23 23:03:00 -07:00
kshitij12345
0b6b735863 [fix] type promotion atan2 (#43466)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466

Reviewed By: malfet

Differential Revision: D23834928

Pulled By: mruberry

fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631
2020-09-23 22:23:05 -07:00
Ailing Zhang
9db3871288 Update true_divide_out to use at::. (#45079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23821701

Pulled By: ailzhang

fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e
2020-09-23 10:50:48 -07:00
Ivan Yashchuk
5b20bf4fd9 Added support for complex input for Cholesky decomposition (#44895)
Summary:
Cholesky decomposition now works for complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/44637.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895

Reviewed By: ailzhang

Differential Revision: D23841583

Pulled By: anjali411

fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478
2020-09-23 08:25:56 -07:00
Xiang Gao
144dacd8d9 CUDA BFloat16 batched gemm (#45167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167

Reviewed By: mruberry

Differential Revision: D23860458

Pulled By: ngimel

fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f
2020-09-22 22:43:52 -07:00
Hong Xu
e2b40ce793 Support BFloat16 for binary logical operators on CUDA (#42485)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684423

Pulled By: mruberry

fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428
2020-09-22 11:42:34 -07:00
anjali411
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
Gao, Xiang
dfb8f2d51f CUDA BFloat16 addmm, addmv (#44986)
Summary:
This PR was originally authored by slayton58. I steal his implementation and added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986

Reviewed By: mruberry

Differential Revision: D23806039

Pulled By: ngimel

fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead
2020-09-21 14:28:27 -07:00
Xiang Gao
581a364437 CUDA BFloat16 unary ops part 1 (#44813)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813

Reviewed By: mruberry

Differential Revision: D23805816

Pulled By: ngimel

fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64
2020-09-21 14:22:31 -07:00
Hong Xu
49db7b59e0 For logical tests, use the dtypes decorator (#42483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684424

Pulled By: mruberry

fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852
2020-09-19 19:01:49 -07:00
Xiao Wang
d75c402755 Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265

This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.

Specifically, when

* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,

cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.

8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)

The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.

On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.

060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)

Note that there is a new heuristic used before cusolver/cublas calls here:

8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)

where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).

Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests

Next step:

If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.

<details>
<summary> benchmark 73499c6 </summary>

benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb

shape meaning:

* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`

| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 |  0.095 |  7.534 |  0.129  |
| [] 4 torch.float32 |  0.009 |  7.522 |  0.129  |
| [] 8 torch.float32 |  0.011 |  7.647 |  0.138  |
| [] 16 torch.float32 |  0.075 |  7.582 |  0.135  |
| [] 32 torch.float32 |  0.073 |  7.573 |  0.191  |
| [] 64 torch.float32 |  0.134 |  7.694 |  0.288  |
| [] 128 torch.float32 |  0.398 |  8.073 |  0.491  |
| [] 256 torch.float32 |  1.054 |  11.860 |  1.074  |
| [] 512 torch.float32 |  5.218 |  14.130 |  2.582  |
| [] 1024 torch.float32 |  19.010 |  18.780 |  6.936  |
| [1] 2 torch.float32 |  0.009 |  0.113 |  0.128 ***regressed |
| [1] 4 torch.float32 |  0.009 |  0.113 |  0.131 ***regressed |
| [1] 8 torch.float32 |  0.011 |  0.116 |  0.129 ***regressed |
| [1] 16 torch.float32 |  0.015 |  0.122 |  0.135 ***regressed |
| [1] 32 torch.float32 |  0.032 |  0.177 |  0.178 ***regressed |
| [1] 64 torch.float32 |  0.070 |  0.420 |  0.281  |
| [1] 128 torch.float32 |  0.328 |  0.816 |  0.490  |
| [1] 256 torch.float32 |  1.125 |  1.690 |  1.084  |
| [1] 512 torch.float32 |  4.344 |  4.305 |  2.576  |
| [1] 1024 torch.float32 |  16.510 |  16.340 |  6.928  |
| [2] 2 torch.float32 |  0.009 |  0.113 |  0.186 ***regressed |
| [2] 4 torch.float32 |  0.011 |  0.115 |  0.184 ***regressed |
| [2] 8 torch.float32 |  0.012 |  0.114 |  0.184 ***regressed |
| [2] 16 torch.float32 |  0.019 |  0.119 |  0.173 ***regressed |
| [2] 32 torch.float32 |  0.050 |  0.170 |  0.240 ***regressed |
| [2] 64 torch.float32 |  0.120 |  0.429 |  0.375  |
| [2] 128 torch.float32 |  0.576 |  0.830 |  0.675  |
| [2] 256 torch.float32 |  2.021 |  1.748 |  1.451  |
| [2] 512 torch.float32 |  9.070 |  4.749 |  3.539  |
| [2] 1024 torch.float32 |  33.655 |  18.240 |  12.220  |
| [4] 2 torch.float32 |  0.009 |  0.112 |  0.318 ***regressed |
| [4] 4 torch.float32 |  0.010 |  0.115 |  0.319 ***regressed |
| [4] 8 torch.float32 |  0.013 |  0.115 |  0.320 ***regressed |
| [4] 16 torch.float32 |  0.027 |  0.120 |  0.331 ***regressed |
| [4] 32 torch.float32 |  0.085 |  0.173 |  0.385 ***regressed |
| [4] 64 torch.float32 |  0.221 |  0.431 |  0.646 ***regressed |
| [4] 128 torch.float32 |  1.102 |  0.834 |  1.055 ***regressed |
| [4] 256 torch.float32 |  4.042 |  1.811 |  2.054 ***regressed |
| [4] 512 torch.float32 |  18.390 |  4.884 |  5.087 ***regressed |
| [4] 1024 torch.float32 |  69.025 |  19.840 |  20.000 ***regressed |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403

Reviewed By: ailzhang, mruberry

Differential Revision: D23717984

Pulled By: ngimel

fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
2020-09-18 20:43:29 -07:00
Gao, Xiang
e255a4e1fd Enable bfloat16 random kernels on Windows (#44918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44918

Reviewed By: pbelevich

Differential Revision: D23777548

Pulled By: ngimel

fbshipit-source-id: 9cf13166d7deba17bc72e402b82ed0afe347cb9b
2020-09-18 15:55:32 -07:00
Xiang Gao
7bd8a6913d CUDA BFloat div, addcdiv, addcmul, mean, var (#44758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44758

Reviewed By: mruberry

Differential Revision: D23752317

Pulled By: ngimel

fbshipit-source-id: 77992cf991f4e2b4b6839de73ea7e6ce2e1061c6
2020-09-18 11:51:11 -07:00
Xiang Gao
f5440a448a CUDA BFloat16 i0 support (#44750)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44750

Reviewed By: glaringlee

Differential Revision: D23764383

Pulled By: ngimel

fbshipit-source-id: d0e784d89241e8028f97766fdac51fe1ab4c188c
2020-09-17 13:30:10 -07:00
Xiang Gao
c189328e5d CUDA BFloat16 unary ops part 2 (#44824)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44824

Reviewed By: mruberry

Differential Revision: D23752360

Pulled By: ngimel

fbshipit-source-id: 3aadaf9db9d4e4937aa38671e8589ecbeece709d
2020-09-17 10:57:43 -07:00
vfdev
24df3b7373 torch.empty_like and torch.zeros_like raise error if any memory format is provided with sparse input (#43699) (#44058)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43699

- Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())`
inside `empty_like` method.

- [x] Added tests

EDIT:

More details on that and why we can not take zeros_like  approach.
Python code :
```python
res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format)
```
is routed to
```c++
// TensorFactories.cpp
Tensor zeros_like(
    const Tensor& self,
    const TensorOptions& options,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  if (options.layout() == kSparse && self.is_sparse()) {
    auto res = at::empty({0}, options); // to be resized
    res.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return res;
  }
  auto result = at::empty_like(self, options, optional_memory_format);
  return result.zero_();
}
```
and passed to `if (options.layout() == kSparse && self.is_sparse())`

When we call in Python
```python
res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format)
```
it is routed to
```c++
Tensor empty_like(
    const Tensor& self,
    const TensorOptions& options_,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  TORCH_CHECK(
    !(options_.has_memory_format() && optional_memory_format.has_value()),
    "Cannot set memory_format both in TensorOptions and explicit argument; please delete "
    "the redundant setter.");
  TensorOptions options =
      self.options()
          .merge_in(options_)
          .merge_in(TensorOptions().memory_format(optional_memory_format));
  TORCH_CHECK(
      !(options.layout() != kStrided &&
          optional_memory_format.has_value()),
      "memory format option is only supported by strided tensors");
  if (options.layout() == kSparse && self.is_sparse()) {
    auto result = at::empty({0}, options); // to be resized
    result.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return result;
  }
```

cc pearu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058

Reviewed By: albanD

Differential Revision: D23672494

Pulled By: mruberry

fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658
2020-09-17 10:25:31 -07:00
Heitor Schueroff de Souza
28085cbd39 Fixed quantile nan propagation and implemented nanquantile (#44393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393

torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23649613

Pulled By: heitorschueroff

fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936
2020-09-17 05:53:25 -07:00
Sameer Deshmukh
e18a2219dd Implement scatter reductions (CUDA), remove divide/subtract (#41977)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33394 .

This PR does two things:
1. Implement CUDA scatter reductions with revamped GPU atomic operations.
2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel .

I've also updated the docs to reflect the existence of only multiply and add.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977

Reviewed By: mruberry

Differential Revision: D23748888

Pulled By: ngimel

fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c
2020-09-16 23:25:21 -07:00
Muthu Arivoli
b61d3d8be8 Implement torch.kaiser_window (#44271)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44271

Reviewed By: ngimel

Differential Revision: D23727972

Pulled By: mruberry

fbshipit-source-id: b4c931b2eb3a536231ad6d6c3cb66e52a13286ac
2020-09-16 20:41:31 -07:00
Xiang Gao
34331b0e0f CUDA BFloat16 and other improvements on abs (#44804)
Summary:
Not sure if ROCm supports `std::abs` today, let's see the CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44804

Reviewed By: mruberry

Differential Revision: D23748837

Pulled By: ngimel

fbshipit-source-id: ccf4e63279f3e5927a85d8d8f70ba4b8c334156b
2020-09-16 20:37:07 -07:00
Ivan Yashchuk
07d9cc80a4 Fix error code checks for triangular_solve (CPU) (#44720)
Summary:
Added missing error checks for the CPU version of `triangular_solve`.
Fixes https://github.com/pytorch/pytorch/issues/43141.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44720

Reviewed By: mruberry

Differential Revision: D23733400

Pulled By: ngimel

fbshipit-source-id: 9837e01b04a6bfd9181e08d46bf96329f292cae0
2020-09-16 13:54:45 -07:00
Natalia Gimelshein
e6101f5507 fixes lda condition for blas functions, fixes bug with beta=0 in addmv slow path (#44681)
Summary:
per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681

Reviewed By: mruberry

Differential Revision: D23708653

Pulled By: ngimel

fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c
2020-09-16 11:47:56 -07:00
Xiang Gao
ee493e1a91 CUDA bfloat compare ops (#44748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44748

Reviewed By: mruberry

Differential Revision: D23725997

Pulled By: ngimel

fbshipit-source-id: 4f89dce3a8b8f1295ced522011b59e60d756e749
2020-09-16 11:32:14 -07:00
Xiang Gao
06036f76b6 CUDA BFloat16 pow (#44760)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44760

Reviewed By: ngimel

Differential Revision: D23727936

Pulled By: mruberry

fbshipit-source-id: 8aa89e989294347d7f593b1a63ce4a1dbfdf783e
2020-09-16 10:01:21 -07:00
Mike Ruberry
686e281bcf Updates div to perform true division (#42907)
Summary:
This PR:

- updates div to perform true division
- makes torch.true_divide an alias of torch.div

This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907

Reviewed By: ngimel

Differential Revision: D23622114

Pulled By: mruberry

fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
2020-09-14 15:50:38 -07:00
kshitij12345
c68a99bd61 [numpy] Add torch.exp2 (#44184)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO
* [x] Add tests
* [x] Add docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184

Reviewed By: ngimel

Differential Revision: D23674237

Pulled By: mruberry

fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c
2020-09-14 04:05:37 -07:00
kshitij12345
42f9f2f38f [fix] ReduceOps throw error if dim is repeated (#44281)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44273

TODO

* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44281

Reviewed By: zhangguanheng66

Differential Revision: D23569004

Pulled By: ezyang

fbshipit-source-id: 1ca6523fef168c8ce252aeb7ca418be346b297bf
2020-09-11 15:34:06 -07:00
guol-fnst
b6b1c01adf torch.view_as_complex fails with segfault for a zero dimensional tensor (#44175)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44175

Reviewed By: colesbury

Differential Revision: D23628103

Pulled By: anjali411

fbshipit-source-id: 6f70b5824150121a1617c0757499832923ae02b5
2020-09-11 08:35:49 -07:00
Xiao Wang
b5d75dddd9 Enable lerp on half type; fix output memory format (#43541)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43541

Reviewed By: zou3519

Differential Revision: D23499592

Pulled By: ezyang

fbshipit-source-id: 9efdd6cbf0a334ec035ddd467667ba874b892549
2020-09-10 21:50:35 -07:00
Peter Bell
129d52aef2 Fix uniqueness check in movedim (#44307)
Summary:
Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307

Reviewed By: mrshenli

Differential Revision: D23598311

Pulled By: zou3519

fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf
2020-09-10 17:41:07 -07:00
Mike Ruberry
c48f511c7e Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: mrshenli, ngimel

Differential Revision: D23617361

Pulled By: mruberry

fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0
2020-09-10 17:31:50 -07:00
Kurt Mohler
28a23fce4c Deprecate torch.norm and torch.functional.norm (#44321)
Summary:
Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44321

Reviewed By: mrshenli

Differential Revision: D23617273

Pulled By: mruberry

fbshipit-source-id: 6f88b5cb097fd0acb9cf0e415172c5a86f94e9f2
2020-09-10 01:16:41 -07:00
Elias Ellison
e0c65abd38 Revert D23568330: [pytorch][PR] Moves some of TestTorchMathOps to OpInfos
Test Plan: revert-hammer

Differential Revision:
D23568330 (a953a825cc)

Original commit changeset: 03e69fccdbfd

fbshipit-source-id: 04ec6843c5eb3c84ddf226dad0088172d9bed84d
2020-09-09 15:48:56 -07:00
mattip
758c2b96f5 BUG: make cholesky_solve_out do broadcast, error checking (#43137)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42695

test, fix `cholesky_solve_out` to use error checking and broadcasting from `cholesky_solve`. Test segfaults before, passes after the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43137

Reviewed By: izdeby

Differential Revision: D23568589

Pulled By: malfet

fbshipit-source-id: 41b67ba964b55e59f1897eef0d96e0f6e1725bef
2020-09-09 11:38:36 -07:00
Mike Ruberry
a953a825cc Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: ngimel

Differential Revision: D23568330

Pulled By: mruberry

fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e
2020-09-09 09:41:03 -07:00
Natalia Gimelshein
ecc6358dbe Port nonzero cuda from THC to ATen (#44259)
Summary:
1) Ports nonzero from THC to ATen
2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU
3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero
4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust
4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndim*nelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway.

Benchmarking is done for tensors with approximately half non-zeros
<details><summary>Benchmarking script</summary>
<p>

```
import torch
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys

device = "cuda"
results = []
for numel in (1024 * 128,):#, 1024 * 1024, 1024 * 1024 * 128):
    inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float)
    for ndim in range(2,3):#(1,4):
        if ndim == 1:
            shape = (numel,)
        elif ndim == 2:
            shape = (1024, numel // 1024)
        else:
            shape = (1024, 128, numel // 1024 // 128)
        inp = inp.reshape(shape)
        repeats = 3
        timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}",
        description = f"ndim {ndim}", globals=globals())
        for i in range(repeats):
            results.append(timer.blocked_autorange())
        print(f"\rnumel {numel} ndim {ndim}", end="")
        sys.stdout.flush()

comparison = Compare(results)
comparison.print()
```
</p>
</details>

### Results
Before:
```
[--------------------------- Nonzero ---------------------------]
                                 |  ndim 1  |   ndim 2  |   ndim 3
 1 threads: ------------------------------------------------------
       number of elts 131072     |    55.2  |     71.7  |     90.5
       number of elts 1048576    |   113.2  |    250.7  |    497.0
       number of elts 134217728  |  8353.7  |  23809.2  |  54602.3

 Times are in microseconds (us).
```
After:
```
[-------------------------- Nonzero --------------------------]
                                |  ndim 1  |  ndim 2  |  ndim 3
1 threads: ----------------------------------------------------
      number of elts 131072     |    48.6  |    79.1  |    90.2
      number of elts 1048576    |    64.7  |   134.2  |   161.1
      number of elts 134217728  |  3748.8  |  7881.3  |  9953.7

Times are in microseconds (us).

```
There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259

Reviewed By: izdeby

Differential Revision: D23581955

Pulled By: ngimel

fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc
2020-09-08 20:52:51 -07:00
Mike Ruberry
bb861e1d69 Ports CUDA var and std reduce all (with no out argument) to ATen, fixes var docs (#43858)
Summary:
When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR:

- Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction
- Fixes var's docs, which listed its arguments in the incorrect order
- Adds new tests comparing var and std with their NumPy counterparts

Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints:

- torch.randn((8000, 8000))
  - var measured 0.0022215843200683594s on CUDA before the change
  - var measured 0.0020322799682617188s on CUDA after the change
- torch.randn((8000, 8000)).T
  - var measured .015128850936889648 on CUDA before the change
  - var measured 0.001912832260131836 on CUDA after the change
- torch.randn(8000 ** 2)
  - std measured 0.11031460762023926 on CUDA before the change
  - std measured 0.0017833709716796875 on CUDA after the change

Timings for var and std are, as expected, similar.

On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change:

```
import torch
import numpy as np
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
base = 8
multiplier = 1

def stdfn(a):
    meanv = a.mean()
    ac = a-meanv
    return torch.sqrt(((ac*ac).sum())/a.numel())

results = []
num_threads=1
for _ in range(7):
    size = base*multiplier
    input = torch.randn(size)

    tasks = [("torch.var(input)", "torch_var"),
             ("torch.var(input, dim=0)", "torch_var0"),
             ("stdfn(input)", "stdfn"),
             ("torch.sum(input, dim=0)", "torch_sum0")
            ]
    timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}",
    description=label, globals=globals()) for stmt, label in tasks]
    repeats = 3

    for i, timer in enumerate(timers * repeats):
        results.append(
            timer.blocked_autorange()
        )
        print(f"\r{i + 1} / {len(timers) * repeats}", end="")
        sys.stdout.flush()
    multiplier *=10
print()

comparison = Compare(results)

comparison.print()
```

The TH timings using this script on my devfair are:

```
[------------------------------ Index ------------------------------]
        | torch_var | torch_var0 |  stdfn  | torch_sum0
1 threads: ----------------------------------------------------------
   8    |   16.0  |    5.6  |   40.9 |    5.0
   80    |   15.9  |    6.1  |   41.6 |    4.9
   800   |   16.7  |   12.0  |   42.3 |    5.0
   8000   |   27.2  |   72.7  |   51.5 |    6.2
   80000  |   129.0  |   715.0  |  133.0 |   18.0
   800000  |  1099.8  |  6961.2  |  842.0 |   112.6
   8000000 |  11879.8  |  68948.5  | 20138.4 |  1750.3
```

and the ATen timings are:

```
[------------------------------ Index ------------------------------]
               |  torch_var  |  torch_var0  |   stdfn   |  torch_sum0
1 threads: ----------------------------------------------------------
      8              |       4.3   |       5.4    |     41.4  |       5.4
      80            |       4.9   |       5.7    |     42.6  |       5.4
      800          |      10.7   |      11.7    |     43.3  |       5.5
      8000        |      69.3   |      72.2    |     52.8  |       6.6
      80000      |     679.1   |     676.3    |    129.5  |      18.1
      800000    |    6770.8   |    6728.8    |    819.8  |     109.7
      8000000  |   65928.2   |   65538.7    |  19408.7  |    1699.4
```

which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too:

```
import torch
import time

# Benchmarking var and std, 1D with varying sizes
base = 8
multiplier = 1

op = torch.var
reps = 1000

for _ in range(7):
    size = base * multiplier
    t = torch.randn(size)
    elapsed = 0
    for _ in range(reps):
        start = time.time()
        op(t)
        end = time.time()
        elapsed += end - start
    multiplier *= 10

    print("Size: ", size)
    print("Avg. elapsed time: ", elapsed / reps)
```

```
var cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins)
Size:  800000
Avg. elapsed time:  0.0010556647777557374 vs 0.00030616092681884767 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009990205764770508 vs 0.002938544034957886 (ATen wins)

std cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.7791500091552735e-05  vs 7.031106948852539e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00018650460243225096 vs 0.00024368906021118164 (TH wins)
Size:  800000
Avg. elapsed time:  0.0010522041320800782 vs 0.0003039860725402832 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009976618766784668 vs. 0.0029211788177490234 (ATen wins)
```

These results show the TH solution still performs better than the ATen solution with default threading for some sizes.

It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858

Reviewed By: zou3519

Differential Revision: D23498981

Pulled By: mruberry

fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050
2020-09-06 09:40:54 -07:00
Muthu Arivoli
719d29dab5 Implement torch.i0 and torch.kaiser_window (#43132)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43132

Reviewed By: smessmer

Differential Revision: D23479072

Pulled By: mruberry

fbshipit-source-id: 4fb1de44830771c6a7222cf19f7728d9ac7c043b
2020-09-05 23:11:47 -07:00
Gao, Xiang
5a0d65b06b Further expand coverage of addmm/addmv, fix 0 stride (#43980)
Summary:
- test beta=0, self=nan
- test transposes
- fixes broadcasting of addmv
- not supporting tf32 yet, will do it in future PR together with other testing fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980

Reviewed By: mruberry

Differential Revision: D23507559

Pulled By: ngimel

fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d
2020-09-04 23:03:23 -07:00
yangu
6cecf7ec68 Enable test_cublas_config_deterministic_error for windows (#42796)
Summary:
test_cublas_config_deterministic_error can pass for windows, so enable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42796

Reviewed By: seemethere

Differential Revision: D23520002

Pulled By: malfet

fbshipit-source-id: eccedbbf202b1cada795071a34e266b2c635c2cf
2020-09-04 09:52:57 -07:00
Xiang Gao
bc45c47aa3 Expand the coverage of test_addmm and test_addmm_sizes (#43831)
Summary:
- This test is very fast and very important, so it makes no sense in marking it as slowTest
- This test is should also run on CUDA
- This test should check alpha and beta support
- This test should check `out=` support
- manual computation should use list instead of index_put because list is much faster
- precision for TF32 needs to be fixed. Will do it in future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831

Reviewed By: ailzhang

Differential Revision: D23435032

Pulled By: ngimel

fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a
2020-09-02 20:51:49 -07:00
Vasiliy Kuznetsov
6a6552576d rename _min_max to _aminmax (#44001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44001

This is to align with the naming in numpy and in
https://github.com/pytorch/pytorch/pull/43092

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_aminmax_cpu_float32
python test/test_torch.py TestTorchDeviceTypeCUDA.test_aminmax_cuda_float32
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23465298

fbshipit-source-id: b599035507156cefa53942db05f93242a21c8d06
2020-09-02 18:07:55 -07:00
Vasiliy Kuznetsov
486a9fdab2 _min_max.dim: CUDA implementation (#42943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42943

Adds a CUDA kernel for _min_max_val.dim

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

performance: ~50% savings on a tensor representative of quantization workloads: https://gist.github.com/vkuzo/3e16c645e07a79dd66bcd50629ff5db0

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086797

fbshipit-source-id: 04a2d310f64a388d48ab8131538dbd287900ca4a
2020-09-02 18:07:51 -07:00
Vasiliy Kuznetsov
834279f4ab _min_max_val.dim: CPU implementation (#42894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894

Continuing the min_max kernel implementation, this PR adds the
CPU path when a dim is specified.  Next PR will replicate for CUDA.

Note: after a discussion with ngimel, we are taking the fast path
of calculating the values only and not the indices, since that is what
is needed for quantization, and calculating indices would require support
for reductions on 4 outputs which is additional work.  So, the API
doesn't fully match `min.dim` and `max.dim`.

Flexible on the name, let me know if something else is better.

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32
```

performance: seeing a 49% speedup on a min+max tensor with similar shapes
to what we care about for quantization observers (bench:
https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For
other shapes (more dims, different dim sizes, etc), I've noticed a
speedup as low as 20%, but we don't have a good use case to optimize
that so perhaps we can save that for a future PR.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086798

fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5
2020-09-02 18:07:47 -07:00
Vasiliy Kuznetsov
78994d165f min_max kernel: add CUDA (#42868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868

Adds a CUDA kernel for the _min_max function.

Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805,
was faster to resubmit than to ressurect that one.  Thanks to durumu
for writing the original implementation!

Future PRs will add index support, docs, and hook this up to observers.

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

Basic benchmarking shows a 50% reduction in time to calculate min + max:
https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9

TODO

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23057766

fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891
2020-09-02 18:06:03 -07:00
anjali411
129f406062 Make torch.conj() a composite function and return self for real tensors (#43270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270

`torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ).

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460493

Pulled By: anjali411

fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9
2020-09-02 17:06:04 -07:00
kshitij12345
b6b5ebc345 Add torch.vdot (#43004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43004

Reviewed By: mruberry

Differential Revision: D23318935

Pulled By: anjali411

fbshipit-source-id: 12d4824b7cb42bb9ca703172c54ec5c663d9e325
2020-09-02 09:00:30 -07:00
Peter Bell
c88ac25679 Check for internal memory overlap in some indexing-type functions (#43423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43423

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23298652

Pulled By: zou3519

fbshipit-source-id: c13c59aec0c6967ef0d6365d782c1f4c98c04227
2020-09-02 08:51:50 -07:00
Peter Bell
5807bb92d3 TensorIteratorConfig: Check memory overlap by default (#43422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43422

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23298653

Pulled By: zou3519

fbshipit-source-id: a7b66a8a828f4b35e31e8be0c07e7fe9339181f2
2020-09-02 08:50:29 -07:00
Hong Xu
4bb5d33076 is_numpy_scalar should also consider bool and complex types (#43644)
Summary:
Before this PR,

```python
import torch
import numpy as np

a = torch.tensor([1, 2], dtype=torch.bool)
c = np.array([1, 2], dtype=np.bool)
print(a[0] == c[0])

a = torch.tensor([1, 2], dtype=torch.complex64)
c = np.array([1, 2], dtype=np.complex64)
print(a[0] == c[0])

 # This case is still broken
a = torch.tensor([1 + 1j, 2 + 2j], dtype=torch.complex64)
c = np.array([1 + 1j, 2 + 2j], dtype=np.complex64)
print(a[0] == c[0])
```

outputs

```
False
False
False
```

After this PR, it outputs:

```
tensor(True)
/home/user/src/pytorch/torch/tensor.py:25: ComplexWarning: Casting complex values to real discards the imaginary part return f(*args, **kwargs)
tensor(True)
tensor(False)
```

Related issue: https://github.com/pytorch/pytorch/issues/43579

cc anjali411 mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43644

Reviewed By: ailzhang

Differential Revision: D23425569

Pulled By: anjali411

fbshipit-source-id: a868209376b30cea601295e54015c47803923054
2020-09-02 07:41:50 -07:00
Xiang Gao
b1f19c20d6 Run function check and out check in TestTensorDeviceOps (#43830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43830

Reviewed By: ailzhang

Differential Revision: D23438101

Pulled By: mruberry

fbshipit-source-id: b581ce779ea2f50ea8dfec51d5469031ec7a0a67
2020-09-01 08:21:53 -07:00
kiyosora
3682df77db Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: ngimel

Differential Revision: D23416743

Pulled By: mruberry

fbshipit-source-id: 9975bd9c9fa73bd0958fe9879f79a692aeb722d5
2020-08-31 15:54:56 -07:00
kshitij12345
0394c5a283 [fix] torch.multinomial : fix for 0 size dim (#43775)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43768

TO-DO:
* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43775

Reviewed By: ZolotukhinM

Differential Revision: D23421979

Pulled By: ngimel

fbshipit-source-id: 949fcdd30f18d17ae1c372fa6ca6a0b8d0d538ce
2020-08-31 11:57:42 -07:00
Xiang Gao
4ef12be900 Add __complex__ (#43844)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43844

Reviewed By: ZolotukhinM

Differential Revision: D23422000

Pulled By: ngimel

fbshipit-source-id: ebc6a27a9b04c77c3977e6c184cefce9e817cc2f
2020-08-31 11:39:41 -07:00
Gao, Xiang
c5d0f091b2 addmm/addmv should accept complex alpha and beta (#43827)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43827

Reviewed By: malfet

Differential Revision: D23415869

Pulled By: ngimel

fbshipit-source-id: a47b76df5fb751f76d36697f5fd95c69dd3a6efe
2020-08-31 11:35:58 -07:00
Xiang Gao
a860be898e [resubmit] Add amax/amin (#43819)
Summary:
Resubmit for landing next week.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819

Reviewed By: ngimel

Differential Revision: D23421906

Pulled By: mruberry

fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f
2020-08-31 04:54:48 -07:00
Jeff Daily
8fb7c50250 Enable complex blas for ROCm. (#43744)
Summary:
Revert "Skips some complex tests on ROCm (https://github.com/pytorch/pytorch/issues/42759)".  This reverts commit 55b1706775.

Use new cuda_to_hip_mappings.py from https://github.com/pytorch/pytorch/issues/43004.

Fixes https://github.com/pytorch/pytorch/pull/42383#issuecomment-670771922

CC sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43744

Reviewed By: glaringlee

Differential Revision: D23391263

Pulled By: ngimel

fbshipit-source-id: ddf734cea3ba69c24f0d79cf1b87c05cdb45ec3d
2020-08-30 22:43:54 -07:00
Xiang Gao
550fb2fd52 Expand the coverage of test_blas_empty (#43822)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43822

Reviewed By: mruberry

Differential Revision: D23413359

Pulled By: ngimel

fbshipit-source-id: fcdb337e32ed2d1c791fa0762d5233b346b26d14
2020-08-29 12:13:15 -07:00
Nikita Shulga
d10056652b Enable torch.half for lt and masked_select (#43704)
Summary:
Enable testing of those options in `TestTorchDeviceTypeCPU.test_logical_cpu` and `TestTorchDeviceTypeCPU.test_masked_select_cpu_float16`
Add `view_as_real` testing for `torch.complex32` type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43704

Reviewed By: albanD

Differential Revision: D23373070

Pulled By: malfet

fbshipit-source-id: 00f17f23b48513379a414227aea91e2d3c0dd5f9
2020-08-29 02:37:26 -07:00
Nikita Shulga
64906497cd Revert D23391941: [pytorch][PR] Implementing NumPy-like function torch.heaviside()
Test Plan: revert-hammer

Differential Revision:
D23391941 (a1eae6d158)

Original commit changeset: 7b942321a625

fbshipit-source-id: c2a7418a1fedaa9493300945c30e2392fc0d08ee
2020-08-28 19:16:58 -07:00
Kurt Mohler
68b9daa9bf Add torch.linalg.norm (#42749)
Summary:
Adds `torch.linalg.norm` function that matches the behavior of `numpy.linalg.norm`.

Additional changes:
* Add support for dimension wrapping in `frobenius_norm` and `nuclear_norm`
* Fix `out` argument behavior for `nuclear_norm`
* Fix issue where `frobenius_norm` allowed duplicates in `dim` argument
* Add `_norm_matrix`

Closes https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42749

Reviewed By: ngimel

Differential Revision: D23336234

Pulled By: mruberry

fbshipit-source-id: f0aba3089a3a0bf856aa9c4215e673ff34228fac
2020-08-28 18:28:33 -07:00
kiyosora
a1eae6d158 Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: glaringlee

Differential Revision: D23391941

Pulled By: mruberry

fbshipit-source-id: 7b942321a62567a5fc0a3679a289f4c4c19e6134
2020-08-28 18:11:20 -07:00
Nikita Shulga
3f0120edb4 Revert D23360705: [pytorch][PR] Add amax/amin
Test Plan: revert-hammer

Differential Revision:
D23360705 (bcec8cc3f9)

Original commit changeset: 5bdeb08a2465

fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381
2020-08-28 18:01:25 -07:00
Gao, Xiang
bcec8cc3f9 Add amax/amin (#43092)
Summary:
Add a max/min operator that only return values.

## Some important decision to discuss
| **Question**                          | **Current State** |
|---------------------------------------|-------------------|
| Expose torch.max_values to python?    | No                |
| Remove max_values and only keep amax? | Yes               |
| Should amax support named tensors?    | Not in this PR    |

## Numpy compatibility

Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html

| Parameter                                                                                                                                                                                                                                              | PyTorch Behavior                                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `axis`:  None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. | Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137)                                |
| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output.                                                                                                   | Same                                                                              |
| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.                                      | implemented as `keepdim`                                                          |
| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice.                                                                                                                              | Not implemented in this PR. Better to implement for all reductions in the future. |
| `where`: array_like of bool, optional. Elements to compare for the maximum.                                                                                                                                                                            | Not implemented in this PR. Better to implement for all reductions in the future. |

**Note from numpy:**
> NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.

PyTorch has the same behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092

Reviewed By: ngimel

Differential Revision: D23360705

Pulled By: mruberry

fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d
2020-08-28 12:51:03 -07:00
Peter Bell
c177d25edf TensorIterator: Check for memory overlap in all nullary_ops (#43421)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43421

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298654

Pulled By: zou3519

fbshipit-source-id: 71b401f6ea1e3b50b830fef650927cc5b3fb940f
2020-08-28 08:40:25 -07:00
Peter Bell
dc0722e9b7 TensorIterator: Check for memory overlap in all compare_ops (#43420)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43420

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23298650

Pulled By: zou3519

fbshipit-source-id: 171cd17a3012880a5d248ffd0ea6942fbfb6606f
2020-08-28 08:40:22 -07:00
Peter Bell
065ebdb92f TensorIterator: Check for memory overlap in all binary_ops (#43419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298655

Pulled By: zou3519

fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a
2020-08-28 08:40:19 -07:00
kshitij12345
c7787f7fbf [numpy compatibility]Fix argmin/argmax when multiple max/min values (#42004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41998
Fixes https://github.com/pytorch/pytorch/issues/22853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42004

Reviewed By: ngimel

Differential Revision: D23049003

Pulled By: mruberry

fbshipit-source-id: a6fddbadfec4b8696730550859395ce4f0cf50d6
2020-08-28 06:42:42 -07:00
kshitij12345
01b5c06254 [fix] handle empty args in chain_matmul (#43553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43553

Reviewed By: agolynski

Differential Revision: D23342586

Pulled By: mruberry

fbshipit-source-id: c6349f8fa9fcefcf03681d92c085a21265d1e690
2020-08-26 18:54:46 -07:00
Xiong Wei
033b7ae3ef implement NumPy-like functionality maximum, minimum (#42579)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Implement NumPy-like functions `maximum` and `minimum`.
The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima.

If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs.

This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579

Reviewed By: mrshenli

Differential Revision: D23153081

Pulled By: mruberry

fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1
2020-08-26 16:56:12 -07:00
Gao, Xiang
88e35fb8bd Skip SVD tests when no lapack (#43566)
Summary:
These tests are failing on one of my system that does not have lapack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43566

Reviewed By: ZolotukhinM

Differential Revision: D23325378

Pulled By: mruberry

fbshipit-source-id: 5d795e460df0a2a06b37182d3d4084d8c5c8e751
2020-08-26 15:58:31 -07:00
Mike Ruberry
4dc8f3be8c Creates test_tensor_creation_ops.py test suite (#43104)
Summary:
As part of our continued refactoring of test_torch.py, this takes tests for tensor creation ops like torch.eye, torch.randint, and torch.ones_like and puts them in test_tensor_creation_ops.py. There hare three test classes in the new test suite: TestTensorCreation, TestRandomTensorCreation, TestLikeTensorCreation. TestViewOps and tests for construction of tensors from NumPy arrays have been left in test_torch.py. These might be refactored separately into test_view_ops.py and test_numpy_interop.py in the future.

Most of the tests ported from test_torch.py were left as is or received a signature change to make them nominally "device generic." Future work will need to review test coverage and update the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43104

Reviewed By: ngimel

Differential Revision: D23280358

Pulled By: mruberry

fbshipit-source-id: 469325dd1a734509dd478cc7fe0413e276ffb192
2020-08-22 23:18:54 -07:00
XiaobingSuper
98307a2821 Fix bfloat16 erfinv get incorrect value problem for cpu path (#43399)
Summary:
Fix https://github.com/pytorch/pytorch/issues/43344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43399

Reviewed By: albanD

Differential Revision: D23264789

Pulled By: pbelevich

fbshipit-source-id: 8b77c0f6ca44346e44599844fb1e172fdbd9df6c
2020-08-21 19:59:37 -07:00
Mike Ruberry
3aec1185e0 Enables bfloat16 x [float16, complex64, complex128] type promotion (#43324)
Summary:
Implements bfloat16 type promotion consistent with JAX (see https://jax.readthedocs.io/en/latest/type_promotion.html), addressing issue https://github.com/pytorch/pytorch/issues/43049.

- bfloat16 x float16 -> float32
- bfloat16 x complex64 -> complex64
- bfloat16 x complex128 -> complex128

Existing tests, after updates, are sufficient to validate the new behavior.

cc xuhdev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43324

Reviewed By: albanD

Differential Revision: D23259823

Pulled By: mruberry

fbshipit-source-id: ca9c2c7d0325faced1f884f3c37edf8fa8c8b089
2020-08-21 10:48:04 -07:00
Mike Ruberry
c64594f5cc Extends test_unary_ufunc.py with numerics, contiguity, domain tests (#42965)
Summary:
This PR:

- ports the tests in TestTorchMathOps to test_unary_ufuncs.py
- removes duplicative tests for the tested unary ufuncs from test_torch.py
- adds a new test, test_reference_numerics, that validates the behavior of our unary ufuncs vs. reference implementations on empty, scalar, 1D, and 2D tensors that are contiguous, discontiguous, and that contain extremal values, for every dtype the unary ufunc supports
- adds support for skipping tests by regex, this behavior is used to make the test suite pass on Windows, MacOS, and ROCm builds, which have a variety of issues, and on Linux builds (see https://github.com/pytorch/pytorch/issues/42952)
- adds a new OpInfo helper, `supports_dtype`, to facilitate test writing
- extends unary ufunc op info to include reference, domain, and extremal value handling information
- adds OpInfos for `torch.acos` and `torch.sin`

These improvements reveal that our testing has been incomplete on several systems, especially with larger float values and complex values, and several TODOs have been added for follow-up investigations. Luckily when writing tests that cover many ops we can afford to spend additional time crafting the tests and ensuring coverage.

Follow-up PRs will:

- refactor TestTorchMathOps into test_unary_ufuncs.py
- continue porting tests from test_torch.py to test_unary_ufuncs.py (where appropriate)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42965

Reviewed By: pbelevich

Differential Revision: D23238083

Pulled By: mruberry

fbshipit-source-id: c6be317551453aaebae9d144f4ef472f0b3d08eb
2020-08-20 22:02:00 -07:00
Nikita Shulga
e10aa47615 Fix at::native::view_as_real() for ComplexHalf Tensors (#43279)
Summary:
Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see:
018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)
Also add ability to convert python complex object to `c10::complex<at::Half>`

Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes

Fixes https://github.com/pytorch/pytorch/issues/43143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279

Reviewed By: mrshenli

Differential Revision: D23230296

Pulled By: malfet

fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44
2020-08-20 17:38:06 -07:00
Natalia Gimelshein
c8bc298d6c streamline stride propagation logic in TensorIterator (#42922)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41314 among other things.
This PR streamlines layout propagation logic in TensorIterator and removes almost all cases of channels-last hardcoding. The new rules and changes are as follows:
1) behavior of undefined `output` and defined output of the wrong (e.g. 0) size is always the same (before this PR the behavior was divergent)
2) in obvious cases (unary operation on memory-dense tensors, binary operations on memory-dense tensors with the same layout) strides are propagated (before propagation was inconsistent) (see footnote)
3) in other cases the output permutation is obtained as inverse permutation of sorting inputs by strides. Sorting is done with comparator obeying the following rules: strides of broadcasted dimensions are set to 0, and 0 compares equal to anything. Strides of not-broadcasted dimensions (including dimensions of size `1`) participate in sorting. Precedence is given to the first input, in case of a tie in the first input, first the corresponding dimensions are considered, and if that does not indicate that swap is needed, strides of the same dimension in subsequent inputs are considered. See changes in `reorder_dimensions` and `compute_strides`. Note that first inspecting dimensions of the first input allows us to better recover it's permutation (and we select this behavior because it more reliably propagates channels-last strides) but in some rare cases could result in worse traversal order for the second tensor.

These rules are enough to recover previously hard-coded behavior related to channels last, so all existing tests are passing.
In general, these rules will produce intuitive results, and in most cases permutation of the full size input (in case of broadcasted operation) will be recovered, or permutation of the first input (in case of same sized inputs) will be recovered, including cases with trivial (1) dimensions. As an example of the latter, the following tensor
```
x=torch.randn(2,1,3).permute(1,0,2)
```
will produce output with the same stride (3,3,1) in binary operations with 1d tensor. Another example is a tensor of size N1H1 that has strides `H,H,1,1` when contiguous and `H, 1, 1, 1` when channels-last. The output retains these strides in binary operations when another 1d tensor is broadcasted on this one.

Footnote: for ambiguous cases where all inputs are memory dense and have the same physical layout that nevertheless can correspond to different permutations, such as e.g. NC11-sized physically contiguous tensors, regular contiguous tensor is returned, and thus permutation information of the input is lost (so for NC11 channels-last input had the strides `C, 1, C, C`, but output will have the strides `C, 1, 1, 1`). This behavior is unchanged from before and consistent with numpy, but it still makes sense to change it. The blocker for doing it currently is performance of `empty_strided`. Once we make it on par with `empty` we should be able to propagate layouts in these cases. For now, to not slow down common contiguous case, we default to contiguous.
The table below shows how in some cases current behavior loses permutation/stride information, whereas new behavior propagates permutation.
| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42922

Reviewed By: ezyang

Differential Revision: D23148204

Pulled By: ngimel

fbshipit-source-id: 670fb6188c7288e506e5ee488a0e11efc8442d1f
2020-08-20 10:50:35 -07:00
Nikita Vedeneev
888ae1b3d8 Introducing Matrix exponential (#40161)
Summary:
Implements (batched) matrix exponential. Fixes [https://github.com/pytorch/pytorch/issues/9983](https://github.com/pytorch/pytorch/issues/9983).

The algorithm follows:
```
 Bader, P.; Blanes, S.; Casas, F.
 Computing the Matrix Exponential with an Optimized Taylor Polynomial Approximation.
 Mathematics 2019, 7, 1174.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40161

Reviewed By: zhangguanheng66

Differential Revision: D22951372

Pulled By: ezyang

fbshipit-source-id: aa068cb76d5cf71696b333d3e72cee287b3089e3
2020-08-18 14:15:10 -07:00
anjali411
aab66602c4 Add torch.dot for complex tensors (#42745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42745

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23056382

Pulled By: anjali411

fbshipit-source-id: c97f15e057095f78069844dbe0299c14104d2fce
2020-08-17 09:05:41 -07:00
Xiaomeng Yang
4ae832e106 Optimize SiLU (Swish) op in PyTorch (#42976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976

Optimize SiLU (Swish) op in PyTorch.

Some benchmark result

input = torch.rand(1024, 32768, dtype=torch.float, device="cpu")
forward: 221ms -> 133ms
backward: 600ms -> 170ms

input = torch.rand(1024, 32768, dtype=torch.double, device="cpu")
forward: 479ms -> 297ms
backward: 1438ms -> 387ms

input = torch.rand(8192, 32768, dtype=torch.float, device="cuda")
forward: 24.34ms -> 9.83ms
backward: 97.05ms -> 29.03ms

input = torch.rand(4096, 32768, dtype=torch.double, device="cuda")
forward: 44.24ms -> 30.15ms
backward: 126.21ms -> 49.68ms

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU"

Reviewed By: houseroad

Differential Revision: D23093593

fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd
2020-08-16 13:21:57 -07:00
Muthu Arivoli
5bcf9b017a Implement hstack, vstack, dstack (#42799)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42799

Reviewed By: izdeby

Differential Revision: D23140704

Pulled By: mruberry

fbshipit-source-id: 6a36363562c50d0abce87021b84b194bb32825fb
2020-08-15 20:39:14 -07:00
ita
91b090ceaf Add polygamma where n >= 2 (#42499)
Summary:
https://github.com/pytorch/pytorch/issues/40980

I have a few questions during implementing Polygamma function...
so, I made PR prior to complete it.

1. some code blocks brought from cephes library(and I did too)
```
/*
 * The following function comes with the following copyright notice.
 * It has been released under the BSD license.
 *
 * Cephes Math Library Release 2.8:  June, 2000
 * Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier
 */
```
is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases)

2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md)
How do I'm sure my code will follow appropriate guidelines of this library..?

3. Actually, there's a digamma, trigamma function already
digamma is needed, however, trigamma function becomes redundant if  polygamma function is added.
it is okay for trigamma to be there or should be removed?

btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499

Reviewed By: gchanan

Differential Revision: D23110016

Pulled By: albanD

fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e
2020-08-14 17:00:24 -07:00
Muthu Arivoli
b8102b1550 Implement torch.nextafter (#42580)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42580

Reviewed By: smessmer

Differential Revision: D23012260

Pulled By: mruberry

fbshipit-source-id: ce82a63c4ad407ec6ffea795f575ca7c58cd6137
2020-08-14 00:35:30 -07:00
Will Gan
e4373083a2 torch.complex and torch.polar (#39617)
Summary:
For https://github.com/pytorch/pytorch/issues/35312 and https://github.com/pytorch/pytorch/issues/38458#issuecomment-636066256.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39617

Reviewed By: zhangguanheng66

Differential Revision: D23083926

Pulled By: anjali411

fbshipit-source-id: 1874378001efe2ff286096eaf1e92afe91c55b29
2020-08-14 00:30:11 -07:00
Natalia Gimelshein
f373cda021 Revert D22994446: [pytorch][PR] CUDA reduction: allow outputs to have different strides
Test Plan: revert-hammer

Differential Revision:
D22994446 (7f3f5020e6)

Original commit changeset: cc60beebad2e

fbshipit-source-id: f4635deac386db0c161f910760cace09f15a1ff9
2020-08-12 17:05:04 -07:00
Muthu Arivoli
92885ebe16 Implement hypot (#42291)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349
Closes https://github.com/pytorch/pytorch/issues/22764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42291

Reviewed By: malfet

Differential Revision: D22951859

Pulled By: mruberry

fbshipit-source-id: d0118f2b6437e5c3f775f699ec46e946a8da50f0
2020-08-12 13:18:26 -07:00
Heitor Schueroff de Souza
62bd2ddec7 Implemented non-named version of unflatten (#42563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42563

Moved logic for non-named unflatten from python nn module to aten/native to be reused by the nn module later. Fixed some inconsistencies with doc and code logic.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23030301

Pulled By: heitorschueroff

fbshipit-source-id: 7c804ed0baa5fca960a990211b8994b3efa7c415
2020-08-12 13:14:28 -07:00
Xiang Gao
7f3f5020e6 CUDA reduction: allow outputs to have different strides (#42649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42364

Benchmark:
https://github.com/zasdfgbnm/things/blob/master/2020Q3/min-benchmark.ipynb
```python
import torch

print(torch.__version__)
print()

for i in range(100):
    torch.randn(1000, device='cuda')

for e in range(7, 15):
    N = 2 ** e
    input_ = torch.randn(N, N, device='cuda')
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    input_ = torch.randn(N, N, device='cuda').t()
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    print()
```
Before
```
1.7.0a0+5d7c3f9

21.7 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 773 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.5 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.2 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.4 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

33 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.1 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

84.2 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.3 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

181 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
145 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

542 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
528 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.04 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.01 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After
```
1.7.0a0+9911817

21.4 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.4 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.5 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

35.4 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.7 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

86.5 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.2 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

195 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
153 µs ± 4.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

550 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
527 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.05 ms ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 ms ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42649

Reviewed By: ezyang

Differential Revision: D22994446

Pulled By: ngimel

fbshipit-source-id: cc60beebad2e04c26ebf3ca702a6cb05846522c9
2020-08-12 13:09:36 -07:00
Kurt Mohler
2f1baf6c25 Fix coding style and safety issues in CuBLAS nondeterministic unit test (#42627)
Summary:
Addresses some comments that were left unaddressed after PR https://github.com/pytorch/pytorch/issues/41377 was merged:

* Use `check_output` instead of `Popen` to run each subprocess sequentially
* Use f-strings rather than old python format string style
* Provide environment variables to subprocess through the `env` kwarg
* Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42627

Reviewed By: malfet

Differential Revision: D22969231

Pulled By: ezyang

fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec
2020-08-12 08:54:28 -07:00
kshitij12345
ab0a04dc9c Add torch.nansum (#38628)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628

Reviewed By: VitalyFedyunin

Differential Revision: D22860549

Pulled By: mruberry

fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710
2020-08-11 22:26:04 -07:00
Kurt Mohler
5edd9aa95a Fix manual seed to unpack unsigned long (#42206)
Summary:
`torch.manual_seed` was unpacking its argument as an `int64_t`. This fix changes it to a `uint64_t`.

Fixes https://github.com/pytorch/pytorch/issues/33546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42206

Reviewed By: ezyang

Differential Revision: D22822098

Pulled By: albanD

fbshipit-source-id: 97c978139c5cb2d5b62cc2c963550c758ee994f7
2020-08-11 18:05:34 -07:00
Heitor Schueroff de Souza
c660d2a9ae Initial quantile operator implementation (#42755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42755

Attempting to land quantile again after being landed here https://github.com/pytorch/pytorch/pull/39417 and reverted here https://github.com/pytorch/pytorch/pull/41616.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23030338

Pulled By: heitorschueroff

fbshipit-source-id: 124a86eea3aee1fdaa0aad718b04863935be26c7
2020-08-11 12:08:17 -07:00
Kurt Mohler
2c8cbd78bd Fix orgqr input size conditions (#42825)
Summary:
* Adds support for `n > k`
* Throw error if `m >= n >= k` is not true
* Updates existing error messages to match argument names shown in public docs
* Adds error tests

Fixes https://github.com/pytorch/pytorch/issues/41776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42825

Reviewed By: smessmer

Differential Revision: D23038916

Pulled By: albanD

fbshipit-source-id: e9bec7b11557505e10e0568599d0a6cb7e12ab46
2020-08-11 10:17:39 -07:00
Kurt Mohler
42b4a7132e Raise error if at::native::embedding is given 0-D weight (#42550)
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.

Fixes https://github.com/pytorch/pytorch/issues/41780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550

Reviewed By: smessmer

Differential Revision: D23040744

Pulled By: albanD

fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
2020-08-11 08:26:45 -07:00
Mike Ruberry
87970b70a7 Adds 'clip' alias for clamp (#42770)
Summary:
Per title. Also updates our guidance for adding aliases to clarify interned_string and method_test requirements. The alias is tested by extending test_clamp to also test clip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42770

Reviewed By: ngimel

Differential Revision: D23020655

Pulled By: mruberry

fbshipit-source-id: f1d8e751de9ac5f21a4f95d241b193730f07b5dc
2020-08-09 02:46:02 -07:00
Mike Ruberry
55b1706775 Skips some complex tests on ROCm (#42759)
Summary:
Fixes ROCm build on OSS master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42759

Reviewed By: ngimel

Differential Revision: D23011560

Pulled By: mruberry

fbshipit-source-id: 3339ecbd5a0ca47aede6f7c3f84739af1ac820d5
2020-08-07 16:12:32 -07:00
anjali411
c9346ad3b8 [CPU] Added torch.bmm for complex tensors (#42383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42383

Test Plan - Updated existing tests to run for complex dtypes as well.

Also added tests for `torch.addmm`, `torch.badmm`

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22960339

Pulled By: anjali411

fbshipit-source-id: 0805f21caaa40f6e671cefb65cef83a980328b7d
2020-08-07 10:04:20 -07:00
Kurt Mohler
df7c059428 Throw error if torch.set_deterministic(True) is called with nondeterministic CuBLAS config (#41377)
Summary:
For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2.

Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377

Reviewed By: malfet

Differential Revision: D22758459

Pulled By: ezyang

fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522
2020-08-05 12:42:24 -07:00
Ivan Yashchuk
b9e68e03c4 Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input (#42425)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42418.

The problem was that the non-contiguous batched matrices were passed to `gemmStridedBatched`.

The following code fails on master and works with the proposed patch:
```python
import torch
x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0')
c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1])
torch.einsum('...ab,...bc->...ac', c, c)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42425

Reviewed By: glaringlee

Differential Revision: D22925266

Pulled By: ngimel

fbshipit-source-id: a72d56d26c7381b7793a047d76bcc5bd45a9602c
2020-08-04 16:11:07 -07:00
Natalia Gimelshein
ec898b1ab5 fix discontiguous inputs/outputs for cummin/cummax (#42507)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42507

Reviewed By: mruberry

Differential Revision: D22917876

Pulled By: ngimel

fbshipit-source-id: 05f3f4a55bcddf6a853552184c9fafcef8d36270
2020-08-04 10:12:07 -07:00
Nikita Shulga
d21e345ef0 Fix segfault in THPGenerator_dealloc (take 2) (#42510)
Summary:
Segfault happens when one tries to deallocate uninitialized generator.
Make `THPGenerator_dealloc` UBSAN-safe by moving implicit cast in the struct definition to reinterpret_cast

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42510

Reviewed By: pbelevich

Differential Revision: D22917469

Pulled By: malfet

fbshipit-source-id: 5eaa68eef10d899ee3e210cb0e1e92f73be75712
2020-08-04 08:06:08 -07:00
Nikita Shulga
0cb86afd72 Revert D22908795: [pytorch][PR] Fix segfault in THPGenerator_dealloc
Test Plan: revert-hammer

Differential Revision:
D22908795 (d3acfe3ba8)

Original commit changeset: c5b6a35db381

fbshipit-source-id: c7559c382fced23cef683c8c90cff2d6012801ec
2020-08-03 21:03:44 -07:00
Natalia Gimelshein
7a5708832f fix masked_select for discontiguous outputs (#41841)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/41473 for discontiguous input, mask and out. Tests to follow. Reverting https://github.com/pytorch/pytorch/issues/33269 is not a great solution because I'm told masked_select was needed for printing complex tensors.
cc gchanan , zou3519, ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41841

Reviewed By: mruberry

Differential Revision: D22706943

Pulled By: ngimel

fbshipit-source-id: 413d7fd3f3308b184de04fd56b8a9aaabcad22fc
2020-08-03 18:43:45 -07:00
Nikita Shulga
d3acfe3ba8 Fix segfault in THPGenerator_dealloc (#42490)
Summary:
Segfault happens when one tries to deallocate unintialized generator

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42490

Reviewed By: seemethere

Differential Revision: D22908795

Pulled By: malfet

fbshipit-source-id: c5b6a35db381738c0fc984aa54e5cab5ef2cbb76
2020-08-03 16:28:34 -07:00
Hong Xu
34025eb826 Vectorize arange (#38697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38697

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22291236

Pulled By: VitalyFedyunin

fbshipit-source-id: 134dd08b77b11e631d914b5500ee4285b5d0591e
2020-08-03 11:14:57 -07:00
Hong Xu
91c80d122a torch.gcd: Do not use std::abs() because it does not have an unsigned integer overload (#42254)
Summary:
`abs` doesn't have an signed overload across all compilers, so applying abs on uint8_t can be ambiguous: https://en.cppreference.com/w/cpp/numeric/math/abs

This may cause unexpected issue when the input is uint8 and is greater
than 128. For example, on MSVC, applying `std::abs` on an unsigned char
variable

```c++
#include <cmath>

unsigned char a(unsigned char x) {
    return std::abs(x);
}
```

gives the following warning:

    warning C4244: 'return': conversion from 'int' to 'unsigned char',
    possible loss of data

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42254

Reviewed By: VitalyFedyunin

Differential Revision: D22860505

Pulled By: mruberry

fbshipit-source-id: 0076d327bb6141b2ee94917a1a21c22bd2b7f23a
2020-08-01 23:03:33 -07:00
Mike Ruberry
2912390662 Limits cpu scalar error message to where it's appropriate (#42360)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40986.

TensorIterator's test for a CUDA kernel getting too many CPU scalar inputs was too permissive. This update limits the check to not consider outputs and to only be performed if the kernel can support CPU scalars.

A test is added to verify the appropriate error message is thrown in a case where the old error message was thrown previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42360

Reviewed By: ngimel

Differential Revision: D22868536

Pulled By: mruberry

fbshipit-source-id: 2bc8227978f8f6c0a197444ff0c607aeb51b0671
2020-08-01 02:04:30 -07:00
Kurt Mohler
206db5c127 Improve torch.norm functionality, errors, and tests (#41956)
Summary:
**BC-Breaking Note:**
BC breaking changes in the case where keepdim=True. Before this change, when calling `torch.norm` with keepdim=True and p='fro' or p=number, leaving all other optional arguments as their default values, the keepdim argument would be ignored. Also, any time `torch.norm` was called with p='nuc', the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. After the change, for each of these cases, the result has the same number and order of dimensions as the input.

**PR Summary:**

* Fix keepdim behavior
* Throw descriptive errors for unsupported sparse norm args
* Increase unit test coverage for these cases and for complex inputs

These changes were taken from part of PR https://github.com/pytorch/pytorch/issues/40924. That PR is not going to be merged because it overrides `torch.norm`'s interface, which we want to avoid. But these improvements are still useful.

Issue https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41956

Reviewed By: albanD

Differential Revision: D22837455

Pulled By: mruberry

fbshipit-source-id: 509ecabfa63b93737996f48a58c7188b005b7217
2020-08-01 01:55:12 -07:00
Mike Ruberry
2f840b1662 Warns when TensorIterator would resize its output (#42079)
Summary:
See https://github.com/pytorch/pytorch/issues/41027.

This adds a helper to resize output to ATen/native/Resize.* and updates TensorIterator to use it. The helper throws a warning if a tensor with one or more elements needs to be resized. This warning indicates that these resizes will become an error in a future PyTorch release.

 There are many functions in PyTorch that will resize their outputs and don't use TensorIterator. For example,

985fd970aa/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu (L243)

And these functions will need to be updated to use this helper, too. This PR avoids their inclusion since the work is separable, and this should let us focus on the function and its behavior in review. A TODO appears in the code to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42079

Reviewed By: VitalyFedyunin

Differential Revision: D22846851

Pulled By: mruberry

fbshipit-source-id: d1a413efb97e30853923bce828513ba76e5a495d
2020-07-30 22:39:16 -07:00
Mike Ruberry
e54f268a7a Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: albanD

Differential Revision: D22836802

Pulled By: mruberry

fbshipit-source-id: 33dfbe4d4067800c418b314b1f60fab8adcab4e7
2020-07-30 22:39:13 -07:00
kshitij12345
31d41f987a torch.where : Scalar Support (#40336)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349 #9190

TODO
* [x] Add Tests
* [x] Update Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40336

Reviewed By: albanD

Differential Revision: D22813834

Pulled By: mruberry

fbshipit-source-id: 67c1693c059a301b249213afee3c25cea9f64fec
2020-07-30 22:36:53 -07:00
Hong Xu
344defc973 Let bfloat16 support promotion with other types (#41698)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41698

Reviewed By: albanD

Differential Revision: D22824042

Pulled By: mruberry

fbshipit-source-id: 7dad9c12dc51d8f88c3ca963ae9c5f8aa2f72277
2020-07-30 12:28:09 -07:00
kiyosora
26d58503c2 Implementing NumPy-like function torch.signbit() (#41589)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.signbit()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41589

Reviewed By: albanD

Differential Revision: D22835249

Pulled By: mruberry

fbshipit-source-id: 7988f7fa8f591ce4b6a23ac884ee7b3aa718bcfd
2020-07-30 11:21:15 -07:00
Mike Ruberry
4b6e5f42a4 Creates spectral ops test suite (#42157)
Summary:
In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops.

The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157

Reviewed By: albanD

Differential Revision: D22811096

Pulled By: mruberry

fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6
2020-07-29 11:36:18 -07:00
Alban Desmaison
460970483d Revert D22790718: [pytorch][PR] Enables torch.full bool and integer type inference
Test Plan: revert-hammer

Differential Revision:
D22790718 (6b3f335641)

Original commit changeset: 8d1eb01574b1

fbshipit-source-id: c321177cce129a6c83f1a7b26bd5ed94a343ac0f
2020-07-29 07:52:04 -07:00
Xiong Wei
90074bbfa6 implement numpy-like functionality isposinf, isneginf (#41588)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

Numpy-like functionalities `isposinf` and `isneginf` are implemented.

Test-Plan:
- pytest test/test_torch.py -k "test_isposinf_isneginf"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41588

Reviewed By: ngimel

Differential Revision: D22770732

Pulled By: mruberry

fbshipit-source-id: 7448653e8fb8df6b9cd4604a4739fe18a1135578
2020-07-29 03:29:31 -07:00
Mike Ruberry
6b3f335641 Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: pbelevich

Differential Revision: D22790718

Pulled By: mruberry

fbshipit-source-id: 8d1eb01574b1977f00bc0696974ac38ffdd40d9e
2020-07-28 23:11:08 -07:00
Hong Xu
2de549518e Make fmod work with zero divisors consistently (#41948)
Summary:
Currently `torch.tensor(1, dtype=torch.int).fmod(0)` crashes (floating point exception).

This PR should fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41948

Reviewed By: ngimel

Differential Revision: D22771081

Pulled By: ezyang

fbshipit-source-id: a94dd35d6cd85daa2d51cae8362004e31f97989e
2020-07-28 08:58:39 -07:00
Natalia Gimelshein
6ca5421a8f Enable non-synchronizing cub scan for cum* operations (#42036)
Summary:
This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing.
Cub does not support more than `2**31` element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller)
so to support that I split the tensor into `2**30` element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42036

Reviewed By: ajtulloch

Differential Revision: D22749945

Pulled By: ngimel

fbshipit-source-id: 9fc9b54d466df9c8885e79c4f4f8af81e3f224ef
2020-07-27 15:44:03 -07:00
Mike Ruberry
12cd083fd7 Updates torch.tensor, torch.as_tensor, and sparse ctors to use the device of inputs tensors they're given, by default (#41984)
Summary:
**BC-Breaking Note**

This PR changes the behavior of the torch.tensor, torch.as_tensor, and sparse constructors. When given a tensor as input and a device is not explicitly specified, these constructors now always infer their device from the tensor. Historically, if the optional dtype kwarg was provided then these constructors would not infer their device from tensor inputs. Additionally, for the sparse ctor a runtime error is now thrown if the indices and values tensors are on different devices and the device kwarg is not specified.

**PR Summary**
This PR's functional change is a single line:

```
auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key)));
```
=>
```
auto device = device_opt.has_value() ? *device_opt : var.device();
```

in `internal_new_from_data`. This line entangled whether the function was performing type inference with whether it inferred its device from an input tensor, and in practice meant that

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t, dtype=torch.float64)
```

would return a tensor on the CPU, not the default CUDA device, while

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t)
```

would return a tensor on the device of `t`!

This behavior is niche and odd, but came up while aocsa was fixing https://github.com/pytorch/pytorch/issues/40648.

An additional side affect of this change is that the indices and values tensors given to a sparse constructor must be on the same device, or the sparse ctor must specify the dtype kwarg. The tests in test_sparse.py have been updated to reflect this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41984

Reviewed By: ngimel

Differential Revision: D22721426

Pulled By: mruberry

fbshipit-source-id: 909645124837fcdf3d339d7db539367209eccd48
2020-07-25 02:49:45 -07:00
Natalia Gimelshein
750d9dea49 move min/max tests to TestTorchDeviceType (#41908)
Summary:
so that testing _min_max on the different devices is easier, and min/max operations have better CUDA test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41908

Reviewed By: mruberry

Differential Revision: D22697032

Pulled By: ngimel

fbshipit-source-id: a796638fdbed8cda90a23f7ff4ee167f45530914
2020-07-23 22:49:30 -07:00
Vishwak Srinivasan
77db93228b Temporary fix for determinant bug on CPU (#35136)
Summary:
Changelog:
- Make diagonal contiguous

Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136

Reviewed By: izdeby

Differential Revision: D22673153

Pulled By: ezyang

fbshipit-source-id: 850f537483f929fcb43bcdef9d4ec264a7c3d354
2020-07-23 10:12:06 -07:00
kshitij12345
266657182a Add torch.movedim (#41480)
Summary:
https://github.com/pytorch/pytorch/issues/38349 #36048

TODO:
* [x] Tests
* [x] Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41480

Reviewed By: zhangguanheng66

Differential Revision: D22649917

Pulled By: zou3519

fbshipit-source-id: a7f3920a24bae16ecf2ad731698ca65ca3e8c1ce
2020-07-23 09:41:01 -07:00
ashishfarmer
586b7f991c Enable skipped tests from test_torch on ROCm (#41611)
Summary:
This pull request enables the following tests from test_torch, previously skipped on ROCm:
test_pow_-2_cuda_float32/float64
test_sum_noncontig_cuda_float64
test_conv_transposed_large

The first two tests experienced precision issues on earlier ROCm version, whereas the conv_transposed test was hitting a bug in MIOpen which is fixed with the version shipping with ROCm 3.5

ezyang jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41611

Reviewed By: xw285cornell

Differential Revision: D22672690

Pulled By: ezyang

fbshipit-source-id: 5585387c048f301a483c4c0566eb9665555ef874
2020-07-22 19:49:17 -07:00
Nikita Vedeneev
7fefa46820 scatter/gather - check that inputs are of the same dimensionality (#41672)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41672

Reviewed By: malfet, ngimel

Differential Revision: D22678302

Pulled By: gchanan

fbshipit-source-id: 95a1bde81e660b8963e5914d5348fd4fbff1338e
2020-07-22 18:51:51 -07:00
Kurt Mohler
ec683299eb Reland Add non-deterministic alert to CUDA operations that use atomicAdd() (#41538)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056

A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538

Reviewed By: zou3519

Differential Revision: D22608376

Pulled By: ezyang

fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
2020-07-22 13:12:29 -07:00
Gregory Chanan
71aad6ea66 Revert "port masked_select from TH to ATen and optimize perf on CPU (#33269)" (#41828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41828

This reverts commit fe66bdb498.

This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap.

Test Plan: Imported from OSS

Reviewed By: orionr

Differential Revision: D22657473

Pulled By: malfet

fbshipit-source-id: 95a806cedf1a3f4df91e6a21de1678252b117489
2020-07-22 09:28:04 -07:00
Vasiliy Kuznetsov
302e566205 add max_and_min function and cpu kernel to speed up observers (#41570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41570

For min/max based quantization observers, calculating min and max of a tensor
takes most of the runtime. Since the calculation of min and max is done
on the same tensor, we can speed this up by only reading the tensor
once, and reducing with two outputs.

One question I had is whether we should put this into the quantization
namespace, since the use case is pretty specific.

This PR implements the easier CPU path to get an initial validation.
There is some needed additional work in future PRs, which durumu will
take a look at:
* CUDA kernel and tests
* making this work per channel
* benchmarking on observer
* benchmarking impact on QAT overhead

Test Plan:
```
python test/test_torch.py TestTorch.test_min_and_max
```

quick bench (not representative of real world use case):
https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca
```
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.0390) tensor(-5.4485) tensor([-5.4485,  5.0390])
min and max separate 11.90243935585022
min and max combined 6.353186368942261
% decrease 0.466228209277153
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.5586) tensor(-5.3983) tensor([-5.3983,  5.5586])
min and max separate 3.468616485595703
min and max combined 1.8227086067199707
% decrease 0.4745142294372342
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.2146) tensor(-5.2858) tensor([-5.2858,  5.2146])
min and max separate 1.5707778930664062
min and max combined 0.8645427227020264
% decrease 0.4496085496757899
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D22589349

fbshipit-source-id: c2e3f1b8b5c75a23372eb6e4c885f842904528ed
2020-07-21 18:16:22 -07:00
Wojciech Baranowski
48569cc330 Reland split (#41567)
Summary:
Take 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41567

Reviewed By: zou3519

Differential Revision: D22586331

Pulled By: albanD

fbshipit-source-id: ca08199da716d64a335455610edbce752fee224b
2020-07-21 08:06:27 -07:00
Alexander Grund
6769b850b2 Remove needless test duplication (#41583)
Summary:
The test loops over `upper` but does not use it effectively running the same test twice which increases test times for no gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41583

Reviewed By: soumith, seemethere, izdeby

Differential Revision: D22598475

Pulled By: zou3519

fbshipit-source-id: d100f20143293a116ff3ba08b0f4eaf0cc5a8099
2020-07-20 10:14:11 -07:00
Justin Huber
c6d0fdd215 torch.isreal (#41298)
Summary:
https://github.com/pytorch/pytorch/issues/38349

mruberry
Not entirely sure if all the changes are necessary in how functions are added to Pytorch.

Should it throw an error when called with a non-complex tensor? Numpy allows non-complex arrays in its imag() function which is used in its isreal() function but Pytorch's imag() throws an error for non-complex arrays.

Where does assertONNX() get its expected output to compare to?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41298

Reviewed By: ngimel

Differential Revision: D22610500

Pulled By: mruberry

fbshipit-source-id: 817d61f8b1c3670788b81690636bd41335788439
2020-07-17 22:07:24 -07:00
Heitor Schueroff de Souza
1734f24276 Revert D22525217: [pytorch][PR] Initial implementation of quantile operator
Test Plan: revert-hammer

Differential Revision:
D22525217 (c7798ddf7b)

Original commit changeset: 27a8bb23feee

fbshipit-source-id: 3beb3d4f8a4d558e993fbdfe977af12c7153afc8
2020-07-17 17:22:48 -07:00
Mike Ruberry
a874c1e584 Adds missing abs to lcm (#41552)
Summary:
lcm was missing an abs. This adds it plus extends the test for NumPy compliance. Also includes a few doc fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41552

Reviewed By: ngimel

Differential Revision: D22580997

Pulled By: mruberry

fbshipit-source-id: 5ce1db56f88df4355427e1b682fcf8877458ff4e
2020-07-17 12:29:50 -07:00
Natalia Gimelshein
324c18fcad fix division by low precision scalar (#41446)
Summary:
Before, inverse for division by scalar is calculated in the precision of the non-scalar operands, which can lead to underflow:
```
>>> x = torch.tensor([3388.]).half().to(0)
>>> scale = 524288.0
>>> x.div(scale)
tensor([0.], device='cuda:0', dtype=torch.float16)
>>> x.mul(1. / scale)
tensor([0.0065], device='cuda:0', dtype=torch.float16)
```
This PR makes results of multiplication by inverse and division the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41446

Reviewed By: ezyang

Differential Revision: D22542872

Pulled By: ngimel

fbshipit-source-id: b60e3244809573299c2c3030a006487a117606e9
2020-07-17 10:41:28 -07:00
Heitor Schueroff de Souza
c7798ddf7b Initial implementation of quantile operator (#39417)
Summary:
Implementing the quantile operator similar to [numpy.quantile](https://numpy.org/devdocs/reference/generated/numpy.quantile.html).

For this implementation I'm reducing it to existing torch operators to get free CUDA implementation. It is more efficient to implement multiple quickselect algorithm instead of sorting but this can be addressed in a future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39417

Reviewed By: mruberry

Differential Revision: D22525217

Pulled By: heitorschueroff

fbshipit-source-id: 27a8bb23feee24fab7f8c228119d19edbb6cea33
2020-07-17 10:15:57 -07:00
kshitij12345
71fdf748e5 Add torch.atleast_{1d/2d/3d} (#41317)
Summary:
https://github.com/pytorch/pytorch/issues/38349

TODO:
 * [x] Docs
 * [x] Tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41317

Reviewed By: ngimel

Differential Revision: D22575456

Pulled By: mruberry

fbshipit-source-id: cc79f4cd2ca4164108ed731c33cf140a4d1c9dd8
2020-07-17 10:10:41 -07:00
Alban Desmaison
b1d4e33c8b Revert D22552377: [pytorch][PR] Reland split unsafe version
Test Plan: revert-hammer

Differential Revision:
D22552377 (5bba973afd)

Original commit changeset: 1d1b713d2429

fbshipit-source-id: 8194458f99bfd5f077b7daa46ca3e81b549adc1b
2020-07-16 15:24:19 -07:00
Mike Ruberry
fef30220fd Runs CUDA test_istft_of_sine on CUDA (#41523)
Summary:
The test was always running on the CPU. This actually caused it to throw an error on non-MKL builds, since the CUDA test (which ran on the CPU) tried to execute but the test requires MKL (a requirement only checked for the CPU variant of the test).

Fixes https://github.com/pytorch/pytorch/issues/41402.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41523

Reviewed By: ngimel

Differential Revision: D22569344

Pulled By: mruberry

fbshipit-source-id: e9908c0ed4b5e7b18cc7608879c6213fbf787da2
2020-07-16 10:43:51 -07:00
Mike Ruberry
b2b8af9645 Removes assertAlmostEqual (#41514)
Summary:
This test function is confusing since our `assertEqual` behavior allows for tolerance to be specified, and this is a redundant mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41514

Reviewed By: ngimel

Differential Revision: D22569348

Pulled By: mruberry

fbshipit-source-id: 2b2ff8aaa9625a51207941dfee8e07786181fe9f
2020-07-16 10:35:12 -07:00
Wojciech Baranowski
5bba973afd Reland split unsafe version (#41484)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/39299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41484

Reviewed By: glaringlee

Differential Revision: D22552377

Pulled By: albanD

fbshipit-source-id: 1d1b713d2429ae162e04bda845ef0838c52df789
2020-07-16 09:01:45 -07:00
Xiang Gao
23174ca71b [reland] Enable TF32 support for cuBLAS (#41498)
Summary:
fix rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41498

Reviewed By: mruberry

Differential Revision: D22560572

Pulled By: ngimel

fbshipit-source-id: 5ee79e96cb29e70d9180830d058efb53d1c6c041
2020-07-15 21:00:55 -07:00
Aayush Naik
200c343184 Implement gcd, lcm (#40651)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/40018.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40651

Reviewed By: ezyang

Differential Revision: D22511828

Pulled By: mruberry

fbshipit-source-id: 3ef251e45da4688b1b64c79f530fb6642feb63ab
2020-07-15 20:56:23 -07:00
Hong Xu
1770937c9c Restore the contiguity preprocessing of linspace (#41286)
Summary:
The contiguity preprocessing was mistakenly removed in
cd48fb5030 . It causes erroneous output
when the output tensor is not contiguous. Here we restore this
preprocessing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41286

Reviewed By: zou3519

Differential Revision: D22550822

Pulled By: ezyang

fbshipit-source-id: ebad4e2ba83d2d808e3f958d4adc9a5513a95bec
2020-07-15 20:02:16 -07:00
Shen Li
954c260061 Revert D22480638: [pytorch][PR] Add non-deterministic alert to CUDA operations that use atomicAdd()
Test Plan: revert-hammer

Differential Revision:
D22480638 (6ff306b8b5)

Original commit changeset: 4cc913cb3ca6

fbshipit-source-id: e47fa14b5085bb2b74a479bd0830efc2d7604eea
2020-07-15 12:10:05 -07:00
Kurt Mohler
6ff306b8b5 Add non-deterministic alert to CUDA operations that use atomicAdd() (#40056)
Summary:
Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40056

Differential Revision: D22480638

Pulled By: ezyang

fbshipit-source-id: 4cc913cb3ca6d4206de80f4665bbc9031aa3ca01
2020-07-15 10:57:32 -07:00
Shen Li
3a63a939d4 Revert D22517785: [pytorch][PR] Enable TF32 support for cuBLAS
Test Plan: revert-hammer

Differential Revision:
D22517785 (288ece89e1)

Original commit changeset: 87334c893561

fbshipit-source-id: 0a0674f49c1bcfc98f7f88af5a8c7de93b76e458
2020-07-15 08:15:48 -07:00
Wojciech Baranowski
14f19ab833 Port index_select to ATen (CUDA) (#39946)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39946

Reviewed By: ngimel

Differential Revision: D22520160

Pulled By: mruberry

fbshipit-source-id: 7eb3029e3917e793f3c020359acb0989d5deb61e
2020-07-15 01:11:32 -07:00
Mike Ruberry
9552ec787c Revert D22516606: [pytorch][PR] Temporary fix for determinant bug on CPU
Test Plan: revert-hammer

Differential Revision:
D22516606 (fcd6d91045)

Original commit changeset: 7ea8299b9d2c

fbshipit-source-id: 41e19d5e1ba843cd70dce677869892f2e33fac09
2020-07-14 23:44:32 -07:00
vishwakftw
fcd6d91045 Temporary fix for determinant bug on CPU (#35136)
Summary:
Changelog:
- Make diagonal contiguous

Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136

Reviewed By: vincentqb

Differential Revision: D22516606

Pulled By: ezyang

fbshipit-source-id: 7ea8299b9d2c1c244995955b333a1dffb0cdff73
2020-07-14 21:20:50 -07:00
Qiao Tan
359cdc20e2 Revert D22432885: [pytorch][PR] unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations
Test Plan: revert-hammer

Differential Revision:
D22432885 (c17670ac50)

Original commit changeset: 324aef091b32

fbshipit-source-id: 6b7c52bde46932e1cf77f61e7035d8a641b0beb6
2020-07-14 16:06:42 -07:00
Wojciech Baranowski
c17670ac50 unsafe_split, unsafe_split_with_sizes, unsafe_chunk operations (#39299)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36403

Copy-paste of the issue description:

* Escape hatch: Introduce unsafe_* version of the three functions above that have the current behavior (outputs not tracked as views). The documentation will explain in detail why they are unsafe and when it is safe to use them. (basically, only the outputs OR the input can be modified inplace but not both. Otherwise, you will get wrong gradients).
* Deprecation: Use the CreationMeta on views to track views created by these three ops and throw warning when any of the views is modified inplace saying that this is deprecated and will raise an error soon. For users that really need to modify these views inplace, they should look at the doc of the unsafe_* version to make sure their usecase is valid:
  * If it is not, then pytorch is computing wrong gradients for their use case and they should not do inplace anymore.
  * If it is, then they can use the unsafe_* version to keep the current behavior.
* Removal: Use the CreationMeta on view to prevent any inplace on these views (like we do for all other views coming from multi-output Nodes). The users will still be able to use the unsafe_ versions if they really need to do this.

Note about BC-breaking:
- This PR changes the behavior of the regular function by making them return proper views now. This is a modification that the user will be able to see.
- We skip all the view logic for these views and so the code should behave the same as before (except the change in the `._is_view()` value).
- Even though the view logic is not performed, we do raise deprecation warnings for the cases where doing these ops would throw an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39299

Differential Revision: D22432885

Pulled By: albanD

fbshipit-source-id: 324aef091b32ce69dd067fe9b13a3f17d85d0f12
2020-07-14 14:15:41 -07:00
Xiang Gao
288ece89e1 Enable TF32 support for cuBLAS (#40800)
Summary:
Benchmark on a fully connected network and torchvision models (time in seconds) on GA100:

| model              | batch size | forward(TF32) | forward(FP32) | backward(TF32) | backward(FP32) |
|--------------------|------------|---------------|---------------|----------------|----------------|
| FC 512-128-32-8    | 512        | 0.000211      | 0.000321      | 0.000499       | 0.000532       |
| alexnet            | 512        | 0.0184        | 0.0255        | 0.0486         | 0.0709         |
| densenet161        | 128        | 0.0665        | 0.204         | 0.108          | 0.437          |
| googlenet          | 256        | 0.0925        | 0.110         | 0.269          | 0.326          |
| inception_v3       | 256        | 0.155         | 0.214         | 0.391          | 0.510          |
| mnasnet1_0         | 512        | 0.108         | 0.137         | 0.298          | 0.312          |
| mobilenet_v2       | 512        | 0.114         | 0.294         | 0.133          | 0.303          |
| resnet18           | 512        | 0.0722        | 0.100         | 0.182          | 0.228          |
| resnext50_32x4d    | 256        | 0.170         | 0.237         | 0.373          | 0.479          |
| shufflenet_v2_x1_0 | 512        | 0.0463        | 0.0473        | 0.125          | 0.123          |
| squeezenet1_0      | 512        | 0.0870        | 0.0948        | 0.205          | 0.214          |
| vgg16              | 256        | 0.167         | 0.234         | 0.401          | 0.502          |
| wide_resnet50_2    | 512        | 0.186         | 0.310         | 0.415          | 0.638          |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40800

Reviewed By: mruberry

Differential Revision: D22517785

Pulled By: ngimel

fbshipit-source-id: 87334c8935616f72a6af5abbd3ae69f76923dc3e
2020-07-14 13:21:10 -07:00
Xiaomeng Yang
80d5b3785b Add torch.logit function (#41062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062

Add torch.logit function

Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit"

Reviewed By: hl475

Differential Revision: D22406912

fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606
2020-07-13 19:33:20 -07:00
Peter Bell
cb6c3526c6 Migrate addmm, addbmm and THBlas_gemm to ATen (#40927)
Summary:
Resubmit #40927
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678

`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.

After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927

Reviewed By: ezyang

Differential Revision: D22468490

Pulled By: ngimel

fbshipit-source-id: f8a22be3216f67629420939455e31a88af20201d
2020-07-10 14:30:55 -07:00
Natalia Gimelshein
e568b3fa2d test nan and inf in TestTorchMathOps (#41225)
Summary:
Per title. `lgamma` produces a different result for `-inf` compared to scipy, so there comparison is skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41225

Differential Revision: D22473346

Pulled By: ngimel

fbshipit-source-id: e4ebda1b10e2a061bd4cef38d1d7b5bf0f581790
2020-07-10 09:46:46 -07:00
Heitor Schueroff de Souza
75a4862f63 Added SiLU activation function (#41034)
Summary:
Implemented the SiLU activation function as discussed in https://github.com/pytorch/pytorch/issues/3169.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41034

Reviewed By: glaringlee

Differential Revision: D22465203

Pulled By: heitorschueroff

fbshipit-source-id: b27d064529fc99600c586ad49b594b52b718b0d2
2020-07-10 07:37:30 -07:00
Thomas Viehmann
a318234eb0 Print raising warnings in Python rather than C++ if other error occurs (#41116)
Summary:
When we return to Python from C++ in PyTorch and have warnings and and error, we have the problem of what to do when the warnings throw because we can only throw one error.
Previously, if we had an error, we punted all warnings to the C++ warning handler which would write them to stderr (i.e. system fid 2) or pass them on to glog.

This has drawbacks if an error happened:
- Warnings are not handled through Python even if they don't raise,
- warnings are always printed with no way to suppress this,
- the printing bypasses sys.stderr, so Python modules wanting to
  modify this don't work (with the prominent example being Jupyter).

This patch does the following instead:
- Set the warning using standard Python extension mechanisms,
- if Python decides that this warning is an error and we have a
  PyTorch error, we print the warning through Python and clear
  the error state (from the warning).

This resolves the three drawbacks discussed above, in particular it fixes https://github.com/pytorch/pytorch/issues/37240 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41116

Differential Revision: D22456393

Pulled By: albanD

fbshipit-source-id: c3376735723b092efe67319321a8a993402985c7
2020-07-09 11:38:07 -07:00
Edward Yang
7ff7c9738c Revert D22418756: [pytorch][PR] Migrate addmm, addbmm and THBlas_gemm to ATen
Test Plan: revert-hammer

Differential Revision:
D22418756 (6725c034b6)

Original commit changeset: 44e7bb596426

fbshipit-source-id: cbaaf3ad277648901700ef0e47715580e8f8e0dc
2020-07-09 07:47:19 -07:00
Natalia Gimelshein
155fb22e77 Run single-threaded gradgradcheck in testnn (#41147)
Summary:
Reland https://github.com/pytorch/pytorch/issues/40999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41147

Reviewed By: mruberry

Differential Revision: D22450357

Pulled By: ngimel

fbshipit-source-id: 02b6e020af5e6ef52542266bd9752b9cfbec4159
2020-07-08 22:53:27 -07:00
Peter Bell
6725c034b6 Migrate addmm, addbmm and THBlas_gemm to ATen (#40927)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24679, closes https://github.com/pytorch/pytorch/issues/24678

`addbmm` depends on `addmm` so needed to be ported at the same time. I also removed `THTensor_(baddbmm)` which I noticed had already been ported so was just dead code.

After having already written this code, I had to fix merge conflicts with https://github.com/pytorch/pytorch/issues/40354 which revealed there was already an established place for cpu blas routines in ATen. However, the version there doesn't make use of ATen's AVX dispatching so thought I'd wait for comment before migrating this into that style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40927

Differential Revision: D22418756

Pulled By: ezyang

fbshipit-source-id: 44e7bb5964263d73ae8cc6adc5f6d4e966476ae6
2020-07-08 17:00:37 -07:00
Brian Vaughan
a04af4dccb Revert D22396896: [pytorch][PR] run single-threaded gradgradcheck in test_nn
Test Plan: revert-hammer

Differential Revision:
D22396896 (dac63a13cb)

Original commit changeset: 3b247caceb65

fbshipit-source-id: 90bbd71ca5128a7f07fe2907c061ee0922d16edf
2020-07-07 07:43:39 -07:00
Natalia Gimelshein
dac63a13cb run single-threaded gradgradcheck in test_nn (#40999)
Summary:
Most time-consuming tests in test_nn (taking about half the time) were gradgradchecks on Conv3d. Reduce their sizes, and, most importantly, run gradgradcheck single-threaded, because that cuts the time of conv3d tests by an order of magnitude, and barely affects other tests.
These changes bring test_nn time down from 1200 s to ~550 s on my machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40999

Differential Revision: D22396896

Pulled By: ngimel

fbshipit-source-id: 3b247caceb65d64be54499de1a55de377fdf9506
2020-07-06 17:21:25 -07:00
Xiao Wang
b7517a76ba rshift use default >> operator (#40545)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40032
Also see https://github.com/pytorch/pytorch/pull/35339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40545

Reviewed By: pbelevich

Differential Revision: D22362816

Pulled By: ngimel

fbshipit-source-id: 4bbf9212b21a4158badbfee8146b3b67e94d5a33
2020-07-02 15:13:12 -07:00
Hong Xu
2cf9fe2d92 Remove more error-exposing tests in exp that cannot be reliably reproduced (#40825)
Summary:
Continuing https://github.com/pytorch/pytorch/issues/40824

All CIs have been enabled (on a branch that starts with `ci-all/`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40825

Differential Revision: D22328732

Pulled By: ezyang

fbshipit-source-id: 3e517d01a9183d95df0687b328fb268947ea5fb0
2020-06-30 22:14:32 -07:00
Hong Xu
29aef8f460 Skip some error-producing exp tests that cannot be reliably reproduced (#40824)
Summary:
This is to take care of additional master CI tests for https://github.com/pytorch/pytorch/issues/39087
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40824

Differential Revision: D22321429

Pulled By: ezyang

fbshipit-source-id: 607e284688b3e4ce24d803a030e31991e4e32fd7
2020-06-30 15:39:09 -07:00
anjali411
c648cd372f Fix complex printing for sci_mode=True (#40513)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40513

This PR makes the following changes:
1. Complex Printing now uses print formatting for it's real and imaginary values and they are joined at the end.
2. Adding 1. naturally fixes the printing of complex tensors in sci_mode=True

```
>>> torch.tensor(float('inf')+float('inf')*1j)
tensor(nan+infj)
>>> torch.randn(2000, dtype=torch.cfloat)
tensor([ 0.3015-0.2502j, -1.1102+1.2218j, -0.6324+0.0640j,  ...,
        -1.0200-0.2302j,  0.6511-0.1889j, -0.1069+0.1702j])
>>> torch.tensor([1e-3, 3+4j, 1e-5j, 1e-2+3j, 5+1e-6j])
tensor([1.0000e-03+0.0000e+00j, 3.0000e+00+4.0000e+00j, 0.0000e+00+1.0000e-05j,
        1.0000e-02+3.0000e+00j, 5.0000e+00+1.0000e-06j])
>>> torch.randn(3, dtype=torch.cfloat)
tensor([ 1.0992-0.4459j,  1.1073+0.1202j, -0.2177-0.6342j])
>>> x = torch.tensor([1e2, 1e-2])
>>> torch.set_printoptions(sci_mode=False)
>>> x
tensor([  100.0000,     0.0100])
>>> x = torch.tensor([1e2, 1e-2j])
>>> x
tensor([100.+0.0000j,   0.+0.0100j])
```

Test Plan: Imported from OSS

Differential Revision: D22309294

Pulled By: anjali411

fbshipit-source-id: 20edf9e28063725aeff39f3a246a2d7f348ff1e8
2020-06-30 11:13:42 -07:00
Hong Xu
a303fd2ea6 Let exp support complex types on CUDA and enable device/dtype in complex tests (#39087)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39087

Differential Revision: D22169697

Pulled By: anjali411

fbshipit-source-id: 4866b7be6742508cc40540ed1ac811f005531d8b
2020-06-30 10:50:40 -07:00
kshitij12345
4104ab8b18 Add torch.count_nonzero (#39992)
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349

TODO:

* [x] Add tests
* [x] Add docs (pending add to docs.rst)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39992

Reviewed By: ezyang

Differential Revision: D22236738

Pulled By: mruberry

fbshipit-source-id: 8520068b086b5ffc4de9e4939e746ff889293987
2020-06-30 06:39:13 -07:00
anjali411
9393ac011a [CUDA] addmm for complex (#40431)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40431

Test Plan: Imported from OSS

Differential Revision: D22285916

Pulled By: anjali411

fbshipit-source-id: 5863c713bdaa8e5b4f3d2b41fa59108502145a23
2020-06-29 17:41:46 -07:00
Sameer Deshmukh
9ca4a46bf8 Implement parallel scatter reductions for CPU (#36447)
Summary:
This PR implements gh-33389.

As a result of this PR, users can now specify various reduction modes for scatter operations. Currently, `add`, `subtract`, `multiply` and `divide` have been implemented, and adding new ones is not hard.

While we now allow dynamic runtime selection of reduction modes, the performance is the same as as was the case for the `scatter_add_` method in the master branch. Proof can be seen in the graph below, which compares `scatter_add_` in the master branch (blue) and `scatter_(reduce="add")` from this PR (orange).
![scatter-regression py csv](https://user-images.githubusercontent.com/2629909/82671491-e5e22380-9c79-11ea-95d6-6344760c8578.png)

The script used for benchmarking is as follows:
``` python
import os
import sys
import torch
import time
import numpy
from IPython import get_ipython

Ms=256
Ns=512
dim = 0
top_power = 2
ipython = get_ipython()

plot_name = os.path.basename(__file__)
branch = sys.argv[1]
fname = open(plot_name + ".csv", "a+")

for pM in range(top_power):
    M = Ms * (2 ** pM)
    for pN in range(top_power):
        N = Ns * (2 ** pN)
        input_one = torch.rand(M, N)
        index = torch.tensor(numpy.random.randint(0, M, (M, N)))
        res = torch.randn(M, N)

        test_case = f"{M}x{N}"
        print(test_case)
        tobj = ipython.magic("timeit -o res.scatter_(dim, index, input_one, reduce=\"add\")")

        fname.write(f"{test_case},{branch},{tobj.average},{tobj.stdev}\n")

fname.close()
```

Additionally, one can see that various reduction modes take almost the same time to execute:
```
op: add
70.6 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: subtract
71 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.4 µs ± 34.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: multiply
70.9 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.4 µs ± 29.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
op: divide
164 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.3 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Script:
``` python
import torch
import time
import numpy
from IPython import get_ipython

ipython = get_ipython()

nrows = 3000
ncols = 10000
dims = [nrows, ncols]

res = torch.randint(5, 10, dims)
idx1 = torch.randint(dims[0], (1, dims[1])).long()
src1 = torch.randint(5, 10, (1, dims[1]))
idx2 = torch.randint(dims[1], (dims[0], 1)).long()
src2 = torch.randint(5, 10, (dims[0], 1))

for op in ["add", "subtract", "multiply", "divide"]:
    print(f"op: {op}")
    ipython.magic("timeit res.scatter_(0, idx1, src1, reduce=op)")
    ipython.magic("timeit res.scatter_(1, idx2, src2, reduce=op)")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36447

Differential Revision: D22272631

Pulled By: ngimel

fbshipit-source-id: 3cdb46510f9bb0e135a5c03d6d4aa5de9402ee90
2020-06-29 15:52:11 -07:00
anjali411
11a74a58c8 Setter for real and imag tensor attributes (#39860)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39860

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D22163234

Pulled By: anjali411

fbshipit-source-id: 35b4aa16499341edff1a4be4076539ac7c74f5be
2020-06-29 15:44:55 -07:00
Mike Ruberry
cb26661fe4 Throws runtime error when torch.full would infer a float dtype from a bool or integral fill value (#40364)
Summary:
BC-breaking NOTE:

In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.

PR NOTE:

This PR causes torch.full to throw a runtime error when it would have inferred a float dtype by being given a boolean or integer value. A versioned symbol for torch.full is added to preserve the behavior of already serialized Torchscript programs. Existing tests for this behavior being deprecated have been updated to reflect it now being unsupported, and a couple new tests have been added to validate the versioned symbol behavior. The documentation of torch.full has also been updated to reflect this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40364

Differential Revision: D22176640

Pulled By: mruberry

fbshipit-source-id: b20158ebbcb4f6bf269d05a688bcf4f6c853a965
2020-06-23 23:27:22 -07:00
Nikita Shulga
7e32e6048d Fix linspace step computation for large integral types (#40132)
Summary:
Convert start and end to `step_t` before computing the difference
Should fix `torch.linspace(-2147483647, 2147483647, 10, dtype=torch.int32)`

Closes https://github.com/pytorch/pytorch/issues/40118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40132

Differential Revision: D22190095

Pulled By: malfet

fbshipit-source-id: 01cb158a30c505191df663d021804d411b697871
2020-06-23 16:59:59 -07:00
Kimish Patel
6a421d50ab Enabling concat fast path for channels last inputs (#39448)
Summary:
Updates concat kernel for contiguous input to support channels_last contig tensors.

This was tried on squeezenet model on pixel-2 device. It improves model perf by about 25%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39448

Test Plan: test_cat_in_channels_last

Differential Revision: D22160526

Pulled By: kimishpatel

fbshipit-source-id: 6eee6e74b8a5c66167828283d16a52022a16997f
2020-06-23 13:01:59 -07:00
anjali411
8ec2ae9a9f Add view_as_real, view_as_complex for complex tensors (#39099)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39099

Test Plan: Imported from OSS

Differential Revision: D22057886

Pulled By: anjali411

fbshipit-source-id: bad5ba7097ba0dd13f2c549b2463094dee9afa14
2020-06-22 15:15:27 -07:00
anjali411
c72ab19458 Add addmv for complex dtypes (#40238)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40238

Differential Revision: D22160528

Pulled By: anjali411

fbshipit-source-id: 04093e5929318a7acc9c9b502c76d0a8bf15d5e1
2020-06-22 10:54:35 -07:00
Hong Xu
3894de569e Reenable memory format test for some unary functions (#39102)
Summary:
Many of them have already been migrated to ATen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39102

Differential Revision: D22162193

Pulled By: VitalyFedyunin

fbshipit-source-id: 80db9914fbd792cd610c4e8ab643ab97845fac9f
2020-06-22 10:46:28 -07:00
Edward Yang
e4766fb4d9 Meta tensors, but without code deduplication (#38490)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38490

A meta tensor is a tensor that is a lot like a normal tensor,
except it doesn't actually have any data associated with it.
You can use them to carry out shape/dtype computations without
actually having to run the actual code; for example, this could
be used to do shape inference in a JIT analysis pass.
Check out the description in DispatchKey.h for more information.

Meta tensors are part of a larger project to rationalize how we
write kernels so that we don't have to duplicate shape logic
in CPU kernel, CUDA kernel and meta kernel (this PR makes the
duplication problem worse!)  However, that infrastructure can
be built on top of this proof of concept, which just shows how
you can start writing meta kernels today even without this
infrastructure.

There are a lot of things that don't work:
- I special cased printing for dense tensors only; if you try to
  allocate a meta sparse / quantized tensor things aren't going
  to work.
- The printing formula implies that torch.tensor() can take an
  ellipsis, but I didn't add this.
- I wrote an example formula for binary operators, but it isn't
  even right!  (It doesn't do type promotion of memory layout
  correctly).  The most future proof way to do it right is to
  factor out the relevant computation out of TensorIterator,
  as it is quite involved.
- Nothing besides torch.add works right now
- Meta functions are ALWAYS included in mobile builds (selective
  build doesn't work on them).  This isn't a big deal for now
  but will become more pressing as more meta functions are added.

One reason I'm putting up this PR now is to check with Yinghai Lu
if we can unblock shape inference for accelerators, while we are
still working on a long term plan for how to unify all shape
computation across our kernels.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21935609

Pulled By: ezyang

fbshipit-source-id: f7d8636eeb8516b6bc296db99a16e56029972eee
2020-06-22 09:18:33 -07:00
rohithkrn
396087bfd8 [ROCm] Enable BFloat16 for pow, exp, erf ops on ROCm (#40236)
Summary:
Enable ops used in BERT which were missed in one of my earlier PRs.
ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40236

Differential Revision: D22143965

Pulled By: ezyang

fbshipit-source-id: 5464ed021687fec1485e1c061e5a7aba71687fc4
2020-06-22 08:22:17 -07:00
Natalia Gimelshein
3bbedb34b9 restore generic IndexToScatterGatherOffset specialization (#40349)
Summary:
https://github.com/pytorch/pytorch/issues/39963 erroneously removed template specialization to compute offsets, causing cases relying on this specialization (topk for 4d+ tensors with topk dimension >= 1024/2048 depending on the type) to produce bogus results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40349

Differential Revision: D22153756

Pulled By: ngimel

fbshipit-source-id: cac04969acb6d7733a7da2c1784df7d30fda1606
2020-06-20 23:14:13 -07:00
Vitaly Fedyunin
a47fb57957 Change memory format promotion rules of point wise operators. (#37968)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37968

Modify memory format promotion rules to avoid promoting when one of the input is ambiguous. New rules are:
 Ambiguous + Contiguous = Contiguous
 Ambiguous + Channels Last = Channels Last
 Contiguous + Ambiguous ( NC11 ) = Contiguous
 Contiguous + Channels Last = Contiguous ( + Warning )  Before this PR: Channels Last
 Channels Last + Contiguous = Channels Last ( + Warning )
 Channels Last + Ambiguous = Channels Last
 Bias + Channels Last = Channels Last
 Channels Last + Bias = Channels Last

Test Plan: Imported from OSS

Differential Revision: D21819573

Pulled By: VitalyFedyunin

fbshipit-source-id: 7381aad11720b2419fb37a6da6ff4f54009c6532
2020-06-20 10:33:32 -07:00
Gregory Chanan
96057c0080 Fix missing deprecation warning for Tensor.nonzero(). (#40187)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40187

There were two issues:
1) The hand-written definition included an ambiguous default, which made the deprecated signature not selected.  This didn't match the handwritten torch.nonzero, now they do.
2) A parsing bug for empty argument lists meant the signature wasn't being marked as deprecated.

Test Plan: Imported from OSS

Differential Revision: D22118236

Pulled By: gchanan

fbshipit-source-id: a433ce9069fef28aea97cbd76f2adf5a285abd73
2020-06-19 09:24:48 -07:00
lixinyu
645d6c014c preserve output tensor's stride in TI's fast setup (#38895)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38895

Test Plan: Imported from OSS

Differential Revision: D21696586

Pulled By: glaringlee

fbshipit-source-id: c7206dbcf74d30998544e221cd0c998c4c25663a
2020-06-18 11:34:21 -07:00
Richard Zou
2ba5f98dd1 Revert D22068657: [pytorch][PR] Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive
Test Plan: revert-hammer

Differential Revision:
D22068657

Original commit changeset: b04c529572a9

fbshipit-source-id: d8227dfc12d9b6382f7bf2905686b6025034561c
2020-06-17 13:05:01 -07:00
mattip
49732f0450 Remove global CMAKE_INSTALL_RPATH_USE_LINK_PATH directive (#37737)
Summary:
Closes gh-35418,

PR gh-16414 added [the `CMAKE_INSTALL_RPATH_USE_LINK_PATH`directive](https://github.com/pytorch/pytorch/pull/16414/files#diff-dcf5891602b4162c36c2125c806639c5R16) which is non-standard and will cause CMake to write an `RPATH` entry for libraries outside the current build. Removing it leaves an RPATH entry for `$ORIGIN` but removes the entries for things like `/usr/local/cuda-10.2/lib64/stubs:/usr/local/cuda-10.2/lib64` for `libcaffe2_nvrtc.so` on linux.

The added test fails before this PR, passes after. It is equivalent to checking `objdump -p torch/lib/libcaffe2_nvrtc.so | grep RPATH` for an external path to the directory where cuda "lives"

I am not sure if it solve the `rpath/libc++.1.dylib` problem for `_C.cpython-37m-darwin.so` on macOS in issue gh-36941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37737

Differential Revision: D22068657

Pulled By: ezyang

fbshipit-source-id: b04c529572a94363855f1e4dd3e93c9db3c85657
2020-06-16 11:18:39 -07:00
Peter Bell
ad86c94f14 Reduce memory requirement for test_argminmax_large_axis (#40036)
Summary:
Closes gh-39060

The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test.

4c7d81f847/aten/src/ATen/native/TensorIterator.cpp (L879)

For reference, the test was first added in gh-33310.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40036

Differential Revision: D22068690

Pulled By: ezyang

fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3
2020-06-16 10:19:10 -07:00
Mike Ruberry
ebd869153c Clarifies compare_with_numpy behavior (#40064)
Summary:
Currently compare_with_numpy requires a device and dtype, but these arguments are ignored if a tensor is provided. This PR updates the function to only take device and dtype if a tensor-like object is given. This should prevent confusion that you could, for example, pass a CPU float tensor but provided a CUDA device and integer dtype.

Several tests are updated to reflect this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40064

Differential Revision: D22058072

Pulled By: mruberry

fbshipit-source-id: b494bb759855977ce45b79ed3ffb0319a21c324c
2020-06-16 05:01:33 -07:00
Xiong Wei
51e341df4f [bernoulli_kernel] Replace CPU_tensor_apply functions with cpu_serial_kernel (#39711)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/39556
Related https://github.com/pytorch/pytorch/issues/38558

Replace CPU_tensor_apply functions with cpu_serial_kernel in bernoulli_kernel, unifying bernoulli_kernel with all other kernels in `cpu/DistributionTemplates.h`.

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39711

Differential Revision: D22052374

Pulled By: pbelevich

fbshipit-source-id: 416334da50195b67f05a18a98971f370cba4fb0d
2020-06-15 14:11:41 -07:00
Kurt Mohler
db2b273d1f Reland: Fix CUDA device guard usage when first arg of kernel is scalar (#39956)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/39870

Closes https://github.com/pytorch/pytorch/issues/38889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39956

Differential Revision: D22027956

Pulled By: ngimel

fbshipit-source-id: e6029f450e2da3782b2d05bcc2012c19b82291da
2020-06-12 21:41:53 -07:00
Kurt Mohler
124cdf2290 Add experimental deterministic flag (#38683)
Summary:
Adds `torch.experimental.deterministic` flag to enforce deterministic algorithms across all of pytorch.
Adds `torch.experimental.deterministic_error_level` to allow users to choose between error/warning/silent if determinism for an operation is not available.
Adds `torch.experimental.alert_not_deterministic()` which should be called within operations that are not deterministic.
Offers both Python and ATen interfaces

Issue https://github.com/pytorch/pytorch/issues/15359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38683

Differential Revision: D21998093

Pulled By: ezyang

fbshipit-source-id: 23aabbddd20f6199d846f97764ff24d728163737
2020-06-12 08:44:06 -07:00
Alban Desmaison
52cc0c2c37 Revert D22011184: [pytorch][PR] Fix CUDA device guard usage when first arg of kernel is scalar
Test Plan: revert-hammer

Differential Revision:
D22011184

Original commit changeset: 427291c456e8

fbshipit-source-id: 7d4979e98bbd9294b91da255ecfc063615741630
2020-06-12 06:46:11 -07:00
Kurt Mohler
2cd27be5b5 Fix CUDA device guard usage when first arg of kernel is scalar (#39870)
Summary:
Add an OptionalDeviceGuard for second arg in gpu_kernel_with_scalars when first arg is scalar

Closes https://github.com/pytorch/pytorch/issues/38889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39870

Differential Revision: D22011184

Pulled By: ngimel

fbshipit-source-id: 427291c456e879f25d15ab76a60b5d4ad61f3b3f
2020-06-11 20:08:43 -07:00
Xiang Gao
b10c53e9b8 Vectorize on output for reduction kernels (#37206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37206

Benchmark on P100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark-vectorize-output.ipynb

```python
import torch
print(torch.__version__)
print()

for i in range(1000):
    torch.arange(10000, device='cuda')

def benchmark(dtype, i):
    size0 = 2 ** (i // 2)
    size1 = 2 ** ((i + 1) // 2)
    a = torch.zeros(size0, size1, device='cuda', dtype=dtype)
    torch.cuda.synchronize()
    %timeit a.sum(dtype=dtype, dim=0); torch.cuda.synchronize()

for dtype in [torch.int8, torch.half, torch.float, torch.double]:
    print(dtype)
    for i in range(18, 30):
        benchmark(dtype, i)
    print()
```
Before
```
1.5.0a0+3bbb36e

torch.int8
24.5 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24.1 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
39 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
59.6 µs ± 244 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
111 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
186 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
397 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
665 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.45 ms ± 837 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.03 ms ± 2.79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float16
24.2 µs ± 66.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24.6 µs ± 255 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.2 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
32 µs ± 91 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
48.1 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
66.9 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
121 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
218 µs ± 384 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
431 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
854 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.75 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.63 ms ± 849 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float32
24.2 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24.4 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
29.3 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.5 µs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
57.4 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.5 µs ± 41.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
288 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
557 µs ± 904 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1e+03 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.98 ms ± 533 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.8 ms ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float64
25 µs ± 54.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
37.1 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
54.3 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84.9 µs ± 65.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
139 µs ± 68.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
275 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
504 µs ± 702 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
987 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.84 ms ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.64 ms ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.19 ms ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After
```
1.5.0a0+3bbb36e

torch.int8
29.8 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.7 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33.4 µs ± 4.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
32.5 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.6 µs ± 94.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
53.7 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
68 µs ± 69.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
98.2 µs ± 88.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
283 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
522 µs ± 563 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
967 µs ± 495 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

torch.float16
29.4 µs ± 68.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
29.2 µs ± 45.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
30.8 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.3 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.1 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
70.4 µs ± 67.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
101 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
157 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
275 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
486 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
936 µs ± 211 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.85 ms ± 124 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

torch.float32
29.9 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
29.5 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33 µs ± 93.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46 µs ± 37.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64 µs ± 73.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
99.4 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
157 µs ± 74.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
265 µs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
490 µs ± 319 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
960 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.84 ms ± 632 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.6 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

torch.float64
33.1 µs ± 74.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
36.7 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46.7 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
61.6 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
100 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
270 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
491 µs ± 445 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
939 µs ± 339 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.88 ms ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.65 ms ± 5.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.3 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

Test Plan: Imported from OSS

Differential Revision: D21233255

Pulled By: ngimel

fbshipit-source-id: d468fddbb228c0c13146dfc6344c470513f9e374
2020-06-11 19:44:17 -07:00
Natalia Gimelshein
f59e38974a fix multinomial for empty batch (#39873)
Summary:
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39873

Reviewed By: ailzhang

Differential Revision: D22004830

Pulled By: ngimel

fbshipit-source-id: 0274cd2ee40e84f06b34e7b53329e95d05a9ddd4
2020-06-11 17:26:39 -07:00
kshitij12345
97dfdaaad8 torch.multinomial : fast-path for replacement=False (#39742)
Summary:
Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import time
import torch
import numpy as np

for n, t in [(500_000, 10),
             (1_000_000, 10)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.from_numpy(np.random.rand(n)).to(dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        print(f'Took:', time.time() - start)

print('****' * 10)

for n, t in [(50_000, 100),
             (100_000, 100)]:
    for dtype in (torch.half, torch.float, torch.double):
        # Input Setup
        p = torch.rand(n, device='cuda', dtype=dtype)
        want = 1000
        print(f'torch.multinomial(a) a.numel() == {n} for {t} times {dtype}')
        start = time.time()
        # torch.cuda.synchronize()
        # Iterate
        for _ in range(t):
            torch.multinomial(p, want, replacement=False)
        # torch.cuda.synchronize()
        print(f'CUDA Took:', time.time() - start)
```

Before:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 80.64455389976501
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 3.7778031826019287
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 5.045570611953735
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.53191947937012
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 7.640851736068726
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 10.399673461914062
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 4.873984098434448
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 4.713594436645508
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 11.167185068130493
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 7.195427417755127
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 7.669712066650391
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 20.20938801765442
```

After:

```
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float16
Took: 81.09321522712708
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float32
Took: 0.06062650680541992
torch.multinomial(a) a.numel() == 500000 for 10 times torch.float64
Took: 0.0862889289855957
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float16
Took: 161.85304307937622
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float32
Took: 0.13271093368530273
torch.multinomial(a) a.numel() == 1000000 for 10 times torch.float64
Took: 0.17215657234191895
****************************************
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float16
CUDA Took: 0.035035133361816406
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float32
CUDA Took: 0.03631949424743652
torch.multinomial(a) a.numel() == 50000 for 100 times torch.float64
CUDA Took: 0.05507040023803711
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float16
CUDA Took: 0.05105161666870117
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float32
CUDA Took: 0.05449223518371582
torch.multinomial(a) a.numel() == 100000 for 100 times torch.float64
CUDA Took: 0.09161853790283203
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39742

Differential Revision: D21976915

Pulled By: ngimel

fbshipit-source-id: 34431f814f31b6dfd6179a89f8e4fa574da7a306
2020-06-10 20:42:55 -07:00
Mike Ruberry
95489b590f Throws runtime error when performing integer division using torch.div (#38620)
Summary:
**1.6 Deprecation Note**

In PyTorch 1.6 attempting to divide two integer tensors or an integer tensor and an integer scalar will throw a runtime error. This behavior was deprecated with a warning in PyTorch 1.5. In PyTorch 1.7 torch.div and the division operator will always perform true division like Python3 and NumPy.

To divide integer values use either torch.true_divide, for true division, or torch.floor_divide (the // operator) for floor division.

**PR Summary**

This PR updates the warning message when performing integer division to be a runtime error. Because some serialized Torchscript programs may rely on torch.div's historic behavior it also implements a "versioned symbol" for div that lets those models retain their current behavior. Extensive tests of this behavior are the majority of this PR.

Note this change bumps the produced file format version to delineate which programs should have their historic div behavior preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38620

Differential Revision: D21612598

Pulled By: mruberry

fbshipit-source-id: c9c33591abce2f7e97f67f0f859901f5b03ed47d
2020-06-10 13:59:34 -07:00
Mike Ruberry
0aecbbb762 Changes TensorIterator computation to not consider out kwarg, lets UnaryOps safe cast to out (#39655)
Summary:
**BC breaking note:**

In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is,

```
out = torch.add(a, b)
```

could produce a different tensor than

```
torch.add(a, b, out=out)
```

This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed.

**ORIGINAL PR NOTE**

This PR effectively rewrites Tensor Iterator's "compute_types" function to both clarify its behavior and change how our type promotion works to never consider the out argument when determining the iterator's "common dtype," AKA its "computation type." That is,

```
a = op(b, c)
```

should always produce the same result as

```
op(b, c, out=a)
```

This is consistent with NumPy and programming languages like Python and C++.

The conceptual model for this change is that a TensorIterator may have a "common computation type" that all inputs are cast to and its computation performed in. This common computation type, if it exists, is determined by applying our type promotion rules to the inputs.

A common computation type is natural for some classes of functions, like many binary elementwise functions (e.g. add, sub, mul, div...). (NumPy describes these as "universal functions.") Many functions, however, like indexing operations, don't have a natural common computation type. In the future we'll likely want to support setting the TensorIterator's common computation type explicitly to enable "floating ufuncs" like the sin function that promote integer types to the default scalar type. Logic like that is beyond the type promotion system, which can only review inputs.

Implementing this change in a readable and maintainable manner was challenging because compute_types() has had many small modifications from many authors over ~2 year period, and the existing logic was in some places outdated and in other places unnecessarily complicated. The existing "strategies" approach also painted with a broad brush, and two of them no longer made conceptual sense after this change. As a result, the new version of this function has a small set of flags to control its behavior. This has the positive effect of disentangling checks like all operands having the same device and their having the same dtype.

Additional changes in this PR:

- Unary operations now support out arguments with different dtypes. Like binary ops they check canCast(computation type, out dtype).
- The dtype checking for lerp was outdated and its error message included the wrong variable. It has been fixed.
- The check for whether all tensors are on the same device has been separated from other checks. TensorIterators used by copy disable this check.
- As a result of this change, the output dtype can be computed if only the input types are available.
- The "fast path" for checking if a common dtype computation is necessary has been updated and simplified to also handle zero-dim tensors.
- A couple helper functions for compute_types() have been inlined to improve readability.
- The confusingly named and no longer used promote_gpu_output_dtypes_ has been removed. This variable was intended to support casting fp16 reductions on GPU, but it has become a nullop. That logic is now implemented here: 856215509d/aten/src/ATen/native/ReduceOpsUtils.h (L207).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39655

Differential Revision: D21970878

Pulled By: mruberry

fbshipit-source-id: 5e6354c78240877ab5d6b1f7cfb351bd89049012
2020-06-10 09:04:13 -07:00
Gregory Chanan
18073ffca3 Add tests for mismatched dtypes in torch.gather. (#39689)
Summary:
https://github.com/pytorch/pytorch/pull/38646 added checks for this, but only added tets for the scatter functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39689

Reviewed By: malfet

Differential Revision: D21945524

Pulled By: gchanan

fbshipit-source-id: 8b06856c06d6427b8cd929a1275422a5ed6e11cc
2020-06-09 08:05:40 -07:00
kshitij12345
9733390998 Add torch.flip{lr, ud} (#38599)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

TODO:
* [x] Add Tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38599

Differential Revision: D21941884

Pulled By: mruberry

fbshipit-source-id: 7a442ff11051c2c868cf8e3c04e4bba0f1a1d426
2020-06-09 07:19:37 -07:00
Nikita Shulga
1790d35848 Skip test_minmax_illegal_dtype on XLA (#39693)
Summary:
It's better to have skipping logic explicitly defined in test decorators rather than in some hard-to-find blacklists
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39693

Differential Revision: D21947893

Pulled By: malfet

fbshipit-source-id: 3d0855eda7e10746ead80fccf84a8db8bf5a3ef1
2020-06-08 22:34:44 -07:00
Nikita Shulga
64192ca3da Skip unit tests relying on MKL if compiled without it (#39672)
Summary:
Also skip TestTorchDeviceTypeCPU.test_float_to_int_conversion_finite_cpu_uint8 on PowerPC
See example of tests failures on https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le/1099/console for
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39672

Differential Revision: D21943588

Pulled By: malfet

fbshipit-source-id: 3da0d33597db5aa8728e682b8e27dd5f7f6765f4
2020-06-08 17:52:00 -07:00
Nik Ved
e4f9c74db3 add dtype checks for scatter/gather family of functions. (#38646)
Summary:
Adds additional dtype checks for scatter/gather family of functions, namely:
1. Checks whether `index` is of type `Long`
2. Checks whether `src.dtype == self.dtype`.

Fixes [https://github.com/pytorch/pytorch/issues/38554](https://github.com/pytorch/pytorch/issues/38554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38646

Differential Revision: D21883033

Pulled By: gchanan

fbshipit-source-id: 4bbd48ec0706ddb002318742edba640871ec0162
2020-06-08 08:42:00 -07:00
William Gan
e41fe60867 Add error message when negative stride is passed to as_strided (#39508)
Summary:
Fixes this issue https://github.com/pytorch/pytorch/issues/33290.
Builds upon this PR https://github.com/pytorch/pytorch/pull/33392.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39508

Differential Revision: D21890557

Pulled By: zou3519

fbshipit-source-id: 8e1a9afb064a6e19551bf3ede3103dd3f023c660
2020-06-08 07:45:24 -07:00
xueht-fnst
faf0a3bd7a Move bernoulli_() to DistributionTemplates (#38558)
Summary:
resolve the feature introduced in https://github.com/pytorch/pytorch/issues/37373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38558

Differential Revision: D21920685

Pulled By: pbelevich

fbshipit-source-id: 50c77d9aaa334b3276a2352afe6c4ad03f12be31
2020-06-07 07:18:30 -07:00
Shawn Zhong
2da5444221 [Resubmit] Fix argmin/max bug (#39576)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38922

See previous PR: https://github.com/pytorch/pytorch/pull/38946

cc: ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39576

Differential Revision: D21906490

Pulled By: ngimel

fbshipit-source-id: f3bfb4e14c4cee60a1e3b80c049945ce85f9f494
2020-06-06 23:47:12 -07:00
Nikita Shulga
8811e4d00d Add/fix typing annotations to some functions (#39075)
Summary:
Add missing typing imports to some jit tests
Add typing annotations to `torch.testing._compare_scalars_internal` and `torch.testing._internal.assertTrue`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39075

Differential Revision: D21882468

Pulled By: malfet

fbshipit-source-id: dd9858eb8e11a38411544cc64daf36fced807d76
2020-06-04 13:40:04 -07:00
Xiong Wei
fe684679b0 Fix overflow issues when unpacking large numbers (#39140)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/33111

relax the overflow and precision lost checks when unpacking doubles.

Signed-off-by: Xiong Wei <xiongw.fnst@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39140

Differential Revision: D21885217

Pulled By: ezyang

fbshipit-source-id: e2bbe90d719443ea2e1c6b7b2c637f9a943fa5c0
2020-06-04 12:24:24 -07:00
krshrimali
335e4a1e3b Add arcosh, arcsinh and arctanh to unary ops (#38388)
Summary:
This PR aims to add `arcosh`, `arcsinh` and `arctanh` support. Please see issue https://github.com/pytorch/pytorch/issues/38349 for more details.

**TODOs:**

* [x] Add test cases for `arcosh`, `arcsinh` and `arctanh`. (need help)
* [x] Overload ops if `std::op` does not work with `thrust::complex` types (like for `sinh`, `cosh`).

Note: `std::acosh, std::asinh, std::atanh` do not support `thrust::complex` types. Added support for complex types for these 3 ops (`arccosh, arcsinh, arctanh`)

cc: mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38388

Differential Revision: D21882055

Pulled By: mruberry

fbshipit-source-id: d334590b47c5a89e491a002c3e41e6ffa89000e3
2020-06-04 11:40:55 -07:00
Aayush Naik
0829cadca3 Implement rad2deg, deg2rad (#38852)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/38372.

cc mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38852

Differential Revision: D21868935

Pulled By: mruberry

fbshipit-source-id: ae6ded11b743c9d1cdc032984b4abe0a115290d6
2020-06-03 22:21:54 -07:00
anjali411
3370c045ae Remove copy_imag and copy_real methods (#39065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39065

Test Plan: Imported from OSS

Differential Revision: D21803939

Pulled By: anjali411

fbshipit-source-id: c7313c527eb6b54d49ef46aa0a839a3418fa8d7e
2020-06-03 18:22:50 -07:00
ShawnZhong
cb530fcd3c Enable some test cases in test_memory_format_operators (#38648)
Summary:
Re-enable some test cases in `test_memory_format_operators` since their corresponding issue has been fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38648

Differential Revision: D21689085

Pulled By: VitalyFedyunin

fbshipit-source-id: 0aa09e0bf31ba98c8ad0191ac3afd31dda0f1d42
2020-06-03 16:02:49 -07:00
Mike Ruberry
9ed5efda47 Adds TestCase.compare_with_numpy (#39179)
Summary:
Cut from https://github.com/pytorch/pytorch/pull/38994.

This is a helper function for comparing torch and NumPy behavior. It updates the existing and increasingly popular _np_compare function and moves it to be a method on TestCase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39179

Differential Revision: D21855082

Pulled By: mruberry

fbshipit-source-id: edca3b78ae392d32243b02bf61960898b6ba590f
2020-06-03 15:27:32 -07:00
JackCaoG
46447045ea Replace torch.allClose with self.assertEqual (#39424)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39424

Reviewed By: Krovatkin

Differential Revision: D21854870

Pulled By: ailzhang

fbshipit-source-id: eb68f1775596e4c963169033444d6d6f4f818d4f
2020-06-03 12:40:50 -07:00
kshitij12345
884e16b41a as_strided : add size and stride length check (#39301)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39281
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39301

Differential Revision: D21849082

Pulled By: gchanan

fbshipit-source-id: 5d30ef10767c4d35c6cb59c5e6a9acbfe0270a40
2020-06-03 09:17:54 -07:00
Peter Bell
7417b4c66f Fix index overflow in ConvTranspose3d [attempt 2] (#39198)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32866, resubmit of https://github.com/pytorch/pytorch/issues/38970

The memory error in the issue is caused by int overflowing in col2vol. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of ConvTranspose3d. vs 20-30% regression with pure 64-bit indexing.

This requires that input.numel() <= UINT_MAX, and channels * kernel.numel() <= UINT_MAX otherwise it raises an error. Previously, the code would crash or give incorrect results unless input.numel() * kernel.numel() <= INT_MAX.

Note that the test is a minimised reproducer for the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39198

Differential Revision: D21817836

Pulled By: ezyang

fbshipit-source-id: b9adfe9f9dd00f04435be132966b33ac6b9efbef
2020-06-03 07:06:54 -07:00
kshitij12345
09bea13981 support flip and rot90 for complex dtype (#37826)
Summary:
Closes https://github.com/pytorch/pytorch/issues/37698
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37826

Differential Revision: D21657697

Pulled By: mruberry

fbshipit-source-id: 16a3899d5de280da692a52bd0ce85d5ebe14cc31
2020-06-02 13:03:14 -07:00
Xiang Gao
48e66859c1 Check illegal output dtype for torch.{min, max} (#38850)
Summary:
The test is currently only enabled for CPU, and it will be enabled for CUDA after the migration of `min` and `max` from THC to ATen is done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38850

Differential Revision: D21819388

Pulled By: ngimel

fbshipit-source-id: 406343e96bccbf9139eb1f8f2d49ed530dd83d62
2020-06-01 16:09:39 -07:00
guol-fnst
7773a45c0d Division by zero crashes for fmod operator(#32699) (#38919)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38919

Differential Revision: D21791648

Pulled By: anjali411

fbshipit-source-id: 447ded74fa52377b04c1b2271a0b3eb5b8e4eeed
2020-06-01 07:48:52 -07:00
anjali411
a50d781c03 Added real and imag views as tensor attributes (#39033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39033

Added `real` and `imag` views as tensor attributes. Right now, tensor.imag is disabled for real tensors. This is because if we return a new tensor of zeros, the user would be able to update the tensor returned by tensor.imag which should not be allowed as numpy returns a read-only array, and pytorch doesn't support read-only tensors yet.

TODO in follow-up PRs:
1. add a setter for `real` and `imag`
2. add special case in codegen for `real` and `imag` backward functions.
3. remove `copy_real` and `copy_imag` methods.

Test Plan: Imported from OSS

Differential Revision: D21767542

Pulled By: anjali411

fbshipit-source-id: 539febf01f01ff055e3fbc7e9ff01fd3fe729056
2020-05-29 12:31:51 -07:00
kshitij12345
10e2126b10 support complex types for cumsum, cumprod (#39063)
Summary:
Adds complex support to `cumsum`, `cumprod` and relevant test update in `test_torch::tensor_op_tests`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39063

Differential Revision: D21771186

Pulled By: anjali411

fbshipit-source-id: 632916d4bdbd1c0941001898ab8146be2b7884fc
2020-05-29 09:36:26 -07:00
Natalia Gimelshein
4b5e87f94a Revert D21751663: [pytorch][PR] Fix argmin/max bug
Test Plan: revert-hammer

Differential Revision:
D21751663

Original commit changeset: 6d55e4bb7834

fbshipit-source-id: 5473af5650b8a14f1da32d660be43ccf027513e1
2020-05-29 09:08:46 -07:00
ShawnZhong
f7a8851e9e Fix argmin/max bug (#38946)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38922

# Reproduction

-  This is correct
```py
>>> torch.zeros(1, 32767).argmax(dim=0)
tensor([0, 0, 0,  ..., 0, 0, 0])
```

- But this is not
```py
>>> torch.zeros(1, 32768).argmax(dim=0)
tensor([    0,     0,     0,  ..., 31141, 31141, 31141])
```

- Only occurs when the size of the reduced dimension is 1

```py
>>> torch.zeros(2, 327680).argmax(dim=0)
tensor([1, 1, 1,  ..., 1, 1, 1])
>>> torch.zeros(3, 327680).argmax(dim=0)
tensor([2, 2, 2,  ..., 2, 2, 2])
```

- Has something to do with the rest of the dims
```py
>>> torch.zeros(1, 327680).argmax(dim=0)
tensor([     0,      0,      0,  ..., 311296, 311296, 311296])
```
```py
>>> torch.zeros(1, 32768, 10).argmax(dim=0)
tensor([[     0,      0,      0,  ...,      0,      0,      0],
        [     0,      0,      0,  ...,      0,      0,      0],
        [     0,      0,      0,  ...,      0,      0,      0],
        ...,
        [311296, 311296, 311296,  ..., 311296, 311296, 311296],
        [311296, 311296, 311296,  ..., 311296, 311296, 311296],
        [311296, 311296, 311296,  ..., 311296, 311296, 311296]])
```

# Reason

- `resize_outputs_` is set to `false` in `reduce_op`, but the dimension is still coalesced during `TensorIterator::build()`

899a075b25/aten/src/ATen/native/TensorIterator.cpp (L703-L715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38946

Differential Revision: D21751663

Pulled By: ngimel

fbshipit-source-id: 6d55e4bb783423b4c2df09cd3e8b87147efcbfdb
2020-05-28 19:42:07 -07:00
Mike Ruberry
ee3bd10445 Moves angle/abs test to test_torch (#39154)
Summary:
Moves test (per request).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39154

Differential Revision: D21769706

Pulled By: mruberry

fbshipit-source-id: a09d0d0a47fbcf8f0e798d57230f2fe6a9ebf6b9
2020-05-28 14:55:40 -07:00
Mike Ruberry
5e975cf8d6 Stops cross-device data movement in tensor iterator (#38998)
Summary:
**BC-breaking note:**

In previous versions of PyTorch zero dimensional CUDA tensors could be moved across devices implicitly. For example,

```
torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')
```

would work, even though the tensors are on different CUDA devices. This is a frequent source of user confusion, however, and PyTorch generally does not move data across devices without it being explicit. This functionality is removed in PyTorch 1.6.

**PR Summary:**

Today in PyTorch we allow implicit data movement of zero dimensional CUDA tensors. For example, we allow:

```
torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')
```

and

```
torch.tensor(2, device='cuda') + torch.tensor((3, 5))
```

In both of these cases TensorIterator would move the zero dim CUDA tensor to the device of the non-scalar tensor (cuda:1 in the first snippet, the CPU in the second snippet).

One of PyTorch's fundamental rules, however, is that it does not perform implicit data movement like this, and this change will causes these cases to throw an error. New tests for this behavior are added to test_torch.py, and tests of the old behavior are removed in test_torch.py and test_autograd.py. A cpp test in tensor_iterator_test.cpp is modified to account for the new behavior.

This addresses https://github.com/pytorch/pytorch/issues/36722.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38998

Differential Revision: D21757617

Pulled By: mruberry

fbshipit-source-id: 2498f07f4938d6de691fdbd5155ad2e881ff7fdb
2020-05-28 13:53:57 -07:00
Rohan Varma
5267b17a96 Revert D21748644: [pytorch][PR] Fix index overflow in ConvTranspose3d
Test Plan: revert-hammer

Differential Revision:
D21748644

Original commit changeset: 95060423219d

fbshipit-source-id: 73c53c8a27a29bc8edd5b9b8c80f0f938b04a845
2020-05-28 13:08:35 -07:00
Peter Bell
5702a28b26 Fix index overflow in ConvTranspose3d (#38970)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32866

The memory error in the issue is caused by `int` overflowing in `col2vol`. This version using mixed 32-bit and 64-bit indexing calculation lifts the maximum indexing possible without compromising the performance of `ConvTranspose3d`. vs 20-30% regression with pure 64-bit indexing.

This requires that `input.numel() <= UINT_MAX`, and `channels * kernel.numel() <= UINT_MAX` otherwise it raises an error. Previously, the code would crash or give incorrect results unless `input.numel() * kernel.numel() <= INT_MAX`.

Note that the test is a minimised reproducer for the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38970

Differential Revision: D21748644

Pulled By: ezyang

fbshipit-source-id: 95060423219dc647595e1a24b3dcac520d3aecba
2020-05-28 07:28:15 -07:00
Nikita Shulga
f5bc91f851 Get rid of multiple inheritence in test_torch (#39110)
Summary:
`_TestTorchMixin` is base class which is instantiated across multiple types.
It was inherited from `object` in order to hide it from unittest test discovery mechanism.
But this approach makes it almost impossible to use static code analyzer on the class.
This PR implements alternative approach by hiding base class into inner class, per https://stackoverflow.com/a/25695512

Change imported class access path in `test_cuda.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39110

Test Plan:
run `test_torch.py --discover-tests` and `test_cuda.py --discover-tests` before and after change:
```
$ python test_torch.py --discover-tests|md5sum
2ca437bb5d65700763ce04cdacf6de3e  -
$ python test_cuda.py --discover-tests|md5sum
b17df916fb0eeb6f0dd7222d7dae392c  -
```

Differential Revision: D21759265

Pulled By: malfet

fbshipit-source-id: b01b06111469e551f7b78387449975e5248f6b9e
2020-05-27 22:45:06 -07:00
Cloud Han
05f097b5bb Implement logaddexp (#38384)
Summary:
Resolve https://github.com/pytorch/pytorch/issues/38377
Related https://github.com/pytorch/pytorch/issues/38349

This op should be disambiguated with `logsumexp` which do a reduction on a tensor over a specific axis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38384

Differential Revision: D21737336

Pulled By: mruberry

fbshipit-source-id: 7864d04ca304c0fb2937bb083583e3e3d6ef205d
2020-05-27 20:27:31 -07:00
Natalia Gimelshein
d92ef9268d Revert D21728402: Simplify precision-specification in tests.
Test Plan: revert-hammer

Differential Revision:
D21728402

Original commit changeset: 85f3daf63f1b

fbshipit-source-id: 4e2a36aca15cd8d842985173395b4e1cac7135d8
2020-05-27 17:34:28 -07:00
Ailing
20397285c6 Replace use of np.allclose in tests. (#34287)
Summary:
fixes https://github.com/pytorch/pytorch/issues/34096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34287

Differential Revision: D21735525

Pulled By: ailzhang

fbshipit-source-id: 611da17cfc5a3fee77d482abccf8f9854f504263
2020-05-27 15:29:35 -07:00
Mike Ruberry
4239416c72 Throws runtime error on attempted addcdiv integer division (#38762)
Summary:
1.6 Deprecation Note:

In 1.6 attempting to perform integer division using addcdiv will throw a RuntimeError, and in 1.7 the behavior will change so that addcdiv always performs a true division of its tensor1 and tensor2 inputs. See the warning in torch.addcdiv's documentation for more information.

PR Summary:

This PR updates the warning that appears when addcdiv performs integer division to throw a RuntimeError. This is intended to prevent silent errors when torch.addcdiv's behavior is changed to always perform true division in 1.7. The documentation is updated (slightly) to reflect this, as our the addcdiv tests in test_torch and test_type_promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38762

Differential Revision: D21657585

Pulled By: mruberry

fbshipit-source-id: c514b44409706f2bcfeca4473424b30cc48aafbc
2020-05-27 14:40:07 -07:00
chengjinfang
c835dedce9 Fix the issue that PyTorch doesn't construct bool tensors from non-bo… (#38392)
Summary:
…ol values correctly(https://github.com/pytorch/pytorch/issues/37398)

Signed-off-by: chengjinfang <chengjf@cn.fujitsu.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38392

Differential Revision: D21737009

Pulled By: mruberry

fbshipit-source-id: c77d8c940af95f5011fe008b48ea0d16c3f501d1
2020-05-27 13:59:28 -07:00
Brian
df4066bbb6 Simplify precision-specification in tests. (#37181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37181

Now that assertEquals considers dtypes in determining tolerance, most
tests don't need explicitly set precision.

Those that do are a few half precision tests on cuda. In this PR, those
are broken out to be handled explicitly, though we may also want to
consider further loosening the tolerance on half-precision.

Test Plan: Imported from OSS

Differential Revision: D21728402

Pulled By: nairbv

fbshipit-source-id: 85f3daf63f1bdbb5101e8dea8c125f13448ca228
2020-05-27 12:05:33 -07:00
Mike Ruberry
13120bf677 Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21740237

Pulled By: mruberry

fbshipit-source-id: acbc027aa1d7877a49664d94db9a5fff91a07042
2020-05-27 06:31:07 -07:00
Rohan Varma
63e545e0fe Revert D21717199: [pytorch][PR] Updates assertEqual to require atol and rtol, removes positional atol
Test Plan: revert-hammer

Differential Revision:
D21717199

Original commit changeset: 9feb856f94ee

fbshipit-source-id: bfde9c39a5ce99f0ca6183a7dde703c65b7c8259
2020-05-26 18:23:59 -07:00
ShawnZhong
12c219de54 Fix histc with empty tensor error (#38987)
Summary:
Fix https://github.com/pytorch/pytorch/issues/38979

The error in mentioned https://github.com/pytorch/pytorch/issues/38979 is a [`cudaErrorInvalidConfiguration` error](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038):
> This indicates that a kernel launch is requesting resources that can never be satisfied by the current device. Requesting more shared memory per block than the device supports will trigger this error, as will requesting too many threads or blocks. See cudaDeviceProp for more device limitations.

This is because we are trying to launch a kernel with block size 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38987

Differential Revision: D21722993

Pulled By: ezyang

fbshipit-source-id: 2c283e0a9f542b4acb96e895a43b991ccac808fe
2020-05-26 13:19:13 -07:00
Mike Ruberry
6ddca30b2d Updates assertEqual to require atol and rtol, removes positional atol (#38872)
Summary:
This updates assertEqual and assertEqual-like functions to either require both or neither of atol and rtol be specified. This should improve clarity around handling precision in the test suite, and it allows us to remove the legacy positional atol argument from assertEqual. In addition, the "message" kwarg is replace with a kwarg-only "msg" argument whose name is consistent with unittest's assertEqual argument.

In the future we could make "msg" an optional third positional argument to be more consistent with unittest's assertEqual, but requiring it be specified should be clear, and we can easily update the signature to make "msg" an optional positional argument in the future, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38872

Differential Revision: D21717199

Pulled By: mruberry

fbshipit-source-id: 9feb856f94eee911b44f6c7140a1d07c1b026d3a
2020-05-26 08:30:23 -07:00
Brian
389e16c33b torch.pow Add type promotion support and fix issue with __rpow__ (#37098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37098

### **Cherry-picked from another stack:**
Some code review already occurred here: https://github.com/pytorch/pytorch/pull/32582

### Summary:

Fixes: https://github.com/pytorch/pytorch/issues/32436

The issue caused incorrect handling of dtypes for scalar ** tensor.
e.g. before this change:
```
>>> 5.5 ** torch.ones(5, dtype=torch.int32)
tensor([5, 5, 5, 5, 5], dtype=torch.int32)
```
should return a float tensor.

Also fixes a number of incorrect cases:
 * tensors to negative powers were giving incorrect results (1 instead
    of 0 or error)
 * Behavior wasn't consistent between cuda/cpu
 * large_value ** 1 in some cases gave a result not equal
    to large_value because of truncation in conversion to double and back.

BC-breaking:

Previously incorrect behavior (in 1.4):
```
>>> a
tensor([1, 1, 1, 1, 1], dtype=torch.int32)
>>> a.pow_(.5)
tensor([1, 1, 1, 1, 1], dtype=torch.int32)
```

After this change:
`RuntimeError: result type Float can't be cast to the desired output type Int`

Test Plan: Imported from OSS

Differential Revision: D21686207

Pulled By: nairbv

fbshipit-source-id: e797e7b195d224fa46404f668bb714e312ea78ac
2020-05-26 08:29:51 -07:00
Xiang Gao
7e6f6f522f [PATCH] Migrate min from THC to ATen and remove _min (#38440)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/36900

Since I feel this PR is already large enough, I didn't migrate max in this PR. Legacy code is not cleaned up either. All these remaining work will be done in later PRs after this is merged.

Benchmark on an extreme case
```python
import torch
print(torch.__version__)

t = torch.randn(100000, 2, device='cuda')

warmup = torch.arange(100000000)
torch.cuda.synchronize()

%timeit t.min(dim=0); torch.cuda.synchronize()
```
Before: 4ms; After: 24.5us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38440

Differential Revision: D21560691

Pulled By: ngimel
2020-05-26 08:10:38 -07:00
kshitij12345
3487744821 Add torch.logcumsumexp (#36308)
Summary:
Creating new PR as I am unable to push to pandeykartikey 's branch as I don't have the permissions.

Closes https://github.com/pytorch/pytorch/issues/26411

Based on https://github.com/pytorch/pytorch/issues/32876 Thanks pandeykartikey for starting this out.

Have addressed the comments.

anjali411 agadetsky albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36308

Differential Revision: D21648573

Pulled By: albanD

fbshipit-source-id: bc1a8fc4ab474a1148298117a1549b0e46f7c3ff
2020-05-21 09:12:31 -07:00
rohithkrn
1ea80b4234 [ROCm] Set correct tolerance values for bfloat16 div tests (#38823)
Summary:
This PR fixes the tolerance values for some of the bfloat16 div tests that were enabled on ROCm with incorrect tolerance values in the PR https://github.com/pytorch/pytorch/pull/38621

Also disabled(to unblock CI) `test_addcdiv*` for which the error is large when absolute values in the tensor are higher. This will have to be investigated further.

ezyang jeffdaily sunway513
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38823

Differential Revision: D21686290

Pulled By: ezyang

fbshipit-source-id: 85472680e1886bdc7c227ed2656e0b4fd5328e46
2020-05-21 07:29:49 -07:00
Nik Ved
f80df4ca79 port scatter_add to ATen (CUDA) (#38262)
Summary:
Fixes [https://github.com/pytorch/pytorch/issues/24622 ](https://github.com/pytorch/pytorch/issues/24622).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38262

Differential Revision: D21656729

Pulled By: ngimel

fbshipit-source-id: 63dcbf8eeaf59d8295bf4e5c8bb9d28ad165d4eb
2020-05-20 19:03:41 -07:00
kshitij12345
3b254acd99 support complex types for tanh_cuda and tanh_backward_cuda (#38786)
Summary:
Builds on https://github.com/pytorch/pytorch/issues/37791
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38786

Differential Revision: D21666138

Pulled By: anjali411

fbshipit-source-id: cbd313b8fd21109aadd614c60259b9dc505771a5
2020-05-20 12:57:40 -07:00
Mingfei Ma
fe66bdb498 port masked_select from TH to ATen and optimize perf on CPU (#33269)
Summary:
This PR ports `masked_select` from TH to ATen and optimize the performance on CPU with TensorIterator.

https://github.com/pytorch/pytorch/issues/33053

1. single socket run: up to **5.4x** speedup;
2. single core run: up to **1.16x** speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33269

Differential Revision: D20922288

Pulled By: ngimel

fbshipit-source-id: 38e183a4e3599bba29bbbebe36264026abe1c50e
2020-05-20 11:36:29 -07:00
nuka137
c78691b4a6 [CPU] torch.gather for complex dtypes (#36430)
Summary:
This PR resolves https://github.com/pytorch/pytorch/issues/36340 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36430

Differential Revision: D21662139

Pulled By: anjali411

fbshipit-source-id: 361d064c1144b368afae3059c19f77abe26080a3
2020-05-20 09:15:14 -07:00
Mike Ruberry
7587188037 Skips test_float_to_int_conversion_finite on MacOS (#38753)
Summary:
See https://github.com/pytorch/pytorch/issues/38752.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38753

Differential Revision: D21656330

Pulled By: mruberry

fbshipit-source-id: f1f97228f31b8a0b0535b3168a7d209fefff2769
2020-05-19 21:56:48 -07:00
Mike Ruberry
64584573f9 Updates tests for integer division deprecation (#38621)
Summary:
Updates our tests in preparation of integer division using torch.div and torch.addcdiv throwing a runtime error by avoiding integer division using torch.div. This creates a brief period where integer division using torch.div is untested, but that should be OK (since it will soon throw a runtime error).

These callsites were identified using https://github.com/pytorch/pytorch/issues/36897.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38621

Differential Revision: D21612823

Pulled By: mruberry

fbshipit-source-id: 749c03a69feae02590b4395335163d9bf047e162
2020-05-19 19:28:00 -07:00
Mike Ruberry
819da00b3d Fixes floordiv dunder registrations (#38695)
Summary:
floordiv was missing a couple dunder registrations, which was causing __ifloordiv__ to not be called when it should. This adds the appropriate registrations and adds a test verifying that the inplace dunders are actually occuring inplace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38695

Differential Revision: D21633980

Pulled By: mruberry

fbshipit-source-id: a423f5ec327cdc062fd6d9d56abd36fe44ac8198
2020-05-19 12:11:38 -07:00
Pavel Belevich
b14734d92e Add bfloat16 to CPU cauchy_kernel, log_normal_kernel, exponential_kernel (#38427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38427

Test Plan: Imported from OSS

Differential Revision: D21640640

Pulled By: pbelevich

fbshipit-source-id: 9cff8f6b5c33b3b31753c76fc8033d329b218019
2020-05-19 10:21:36 -07:00
Pavel Belevich
35beff0b9f RNG infrastructure improvements (#37984)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37984

- `NumericUtils.h`
CUDA distribution kernels had two variants of transformation labdas(`uniform`/`normal` -> `lognormal`/`exponential`/`cauchy`/`geometric`...): for double-precision and optimized for CUDA single precision. It was done by using `::log`/`__logf`, `::exp`/`__expf` and `::tan/__tanf`. I moved them to `NumericUtils.h` and called them `at::exp`, `at::log` and `at::tan`. It allowed to unify CPU/CUDA transformation templates in `TransformationHelper.h`.

- `DistributionsHelper.h`
Made `normal_distribution`, `geometric_distribution`, `exponential_distribution`, `cauchy_distribution`, `lognormal_distribution` C10_HOST_DEVICE compatible to reuse them in CPU/CUDA distribution kernels.
Replaced explicit math with transformations from `TransformationHelper.h`

- `TransformationHelper.h`
Renamed `*_transformation` to `transformation::*`
Added clear unified host/device transformations templates `normal`, `cauchy`, `exponential`, `geometric`, `log_normal` which are used by both CPU and CUDA distribution kernels and custom PRNG distribution kernels.

- `cpu/DistributionTemplates.h`
Unified `normal_kernel`, `cauchy_kernel`, `log_normal_kernel`, `geometric_kernel`, `exponential_kernel`.

- `cuda/DistributionTemplates.h`
Extracted `UNIFORM_AND_TRANSFORM` and `NORMAL_AND_TRANSFORM` macros to reuse code between distribution kernel templates.
Unified transformation labdas(`uniform`/`normal` -> `lognormal`/`exponential`/`cauchy`/`geometric`...)

- `test_torch.py`
Added `scipy.stats.kstest` [Kolmogorov–Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) tests for `uniform`/`normal`/`lognormal`/`exponential`/`cauchy` distributions and [Chi-squared](https://en.wikipedia.org/wiki/Chi-squared_test) test for `geometric` one. To make sure that our distributions are correct.

- `cpu_rng_test.cpp`, `rng_test.h`
Fixed random_()'s from and to bounds issue for floating-point types, fixed cast/overflow warnings

- `THTensorRandom.h`, `THVector.h`
Moved unnecessary includes to `THTensorRandom.cpp`

Test Plan: Imported from OSS

Differential Revision: D21477955

Pulled By: pbelevich

fbshipit-source-id: 7b793d1761a7a921c4b4a4a7d21d5d6c48f03e72
2020-05-19 10:20:39 -07:00
kshitij12345
fc19747d64 handle grad with stride=0 on GPU MvBackward (#38321)
Summary:
References : https://github.com/pytorch/pytorch/issues/38315 ,  https://github.com/pytorch/pytorch/issues/29984

cuBlas expects strides to be greater than 0.
Cloning the `grad` allocates a new vector with
non-zero strides.

For CPU, we don't clone and allocate a new vector
as CPU implementation works with stride=0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38321

Differential Revision: D21628966

Pulled By: ngimel

fbshipit-source-id: 390caf835af6d1d77ed537b7fcc113a22c3ec301
2020-05-18 20:53:36 -07:00
anjali411
f3048609d3 [CUDA] torch.roll for complex dtypes (#38664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38664

Test Plan: Imported from OSS

Differential Revision: D21630498

Pulled By: anjali411

fbshipit-source-id: bf43a812f3d8dd984785256bad41131410435965
2020-05-18 18:19:22 -07:00
Xiang Gao
83df3beaca Add complex support for torch.sum (#38382)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38382

Test Plan: Imported from OSS

Differential Revision: D21600127

Pulled By: anjali411

fbshipit-source-id: c5338ab10bdcebe4a281b03f78e6f2063186bc32
2020-05-15 19:49:38 -07:00
Mike Ruberry
9cfc10d52e Updates assertEqual to use torch.isclose-like logic (#37294)
Summary:
Edit: this has been updated to reflect the PR's current status, which has changed after review.

This PR updates the behavior of the assertEqual, assertNotEqual, and assert_allclose to be consistent with each other and torch.isclose. It corrects several additional bugs in the current implementations and adds extensive testing and comments, too.

These updates follow from changes to assertEqual like https://github.com/pytorch/pytorch/pull/34258 and https://github.com/pytorch/pytorch/pull/37069, and from our discussion of torch.isclose for complex tensors (see https://github.com/pytorch/pytorch/issues/36462), where we decided to implement a NumPy-compatible mathematical notion of "closeness" for complex tensors that is not a great fit for our testing framework.

The detailed changelist is:

- New test framework functions for comparing tensors and scalars
  - Tensors are compared using isclose; the real and imaginary parts of complex tensors are compared independently
  - Scalars are compared using the same algorithm
  - assertEqual and assert_allclose now use this common comparison function, instead of each implementing their own with divergent behavior
  - assertEqual-like debug messages are now available for all tensor and scalar comparisons, with additional context when comparing the components of sparse, quantized, and complex tensors
- Extensive testing of the comparison behavior and debug messages
- Small Updates
  - assertEqual now takes an "exact_device" argument, analogous to "exact_dtype", which should be useful in multidevice tests
  - assertEqual now takes an "equal_nan" argument for argument consistency with torch.isclose
  - assertEqual no longer takes the "allow_inf" keyword, which misleadingly only applied to scalar comparisons, was only ever set (rarely) to true, and is not supported by torch.isclose
- Bug fixes:
  - the exact_dtype attribute has been removed (no longer needed after https://github.com/pytorch/pytorch/pull/38103)
  - message arguments passed to assertEqual are now handled correctly
  - bool x other dtype comparisons are now supported
  - uint8 and int8 tensor comparisons now function properly
  - rtol for integer comparisons is now supported (default is zero)
  - rtol and atol for scalar comparisons are now supported
  - complex scalar comparisons are now supported, analogous to complex tensor comparisons
  - assertNotEqual is now equivalent to the logical negation of assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37294

Differential Revision: D21596830

Pulled By: mruberry

fbshipit-source-id: f2576669f7113a06f82581fc71883e6b772de19b
2020-05-15 16:24:03 -07:00
Gregory Chanan
70ef9f5124 Improve testing of logical_not. (#38505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38505

This takes the testing of https://github.com/pytorch/pytorch/pull/38275, but doesn't include the kernel changes which are still being worked out.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D21580574

Pulled By: gchanan

fbshipit-source-id: f12317259cb7373989f6c9ad345b19aaac524851
2020-05-15 10:51:35 -07:00
anjali411
242af6c078 Add tan_cuda for complex dtypes (#38400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38400

* #38399 Added autograd tests, disabled jit autograd tests for complex and added a separate list for tests for complex dtype only

Test Plan: Imported from OSS

Differential Revision: D21572209

Pulled By: anjali411

fbshipit-source-id: 7036029e9f8336139f5d54e0dfff9759f3bf8376
2020-05-15 08:16:59 -07:00
Michael Carilli
25f918548d Allow GradScaler to be pickled (#38296)
Summary:
Should unblock https://github.com/PyTorchLightning/pytorch-lightning/issues/1782.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38296

Differential Revision: D21553296

Pulled By: albanD

fbshipit-source-id: 9041a72d7cf8833e4b01bc767fd2321f17c7c5f2
2020-05-14 09:14:28 -07:00
SsnL
ae392a77a6 Add better device idx parse checks (#37376)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37376

Differential Revision: D21476036

Pulled By: zou3519

fbshipit-source-id: 86907083c23cbaf165b645307fb340f2656b814e
2020-05-14 09:07:12 -07:00
Peter Bell
0a159b0a3a Fix precision issues in CPU remainder (#38293)
Summary:
Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861.

This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals.

Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`.

I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293

Differential Revision: D21539801

Pulled By: ezyang

fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5
2020-05-14 08:54:32 -07:00
Cloud Han
8d94615c2b Migrate erfc from TH to ATen (CUDA) (#38373)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/24559
Reference https://github.com/pytorch/pytorch/issues/24507
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38373

Differential Revision: D21549626

Pulled By: ezyang

fbshipit-source-id: 84c2cf58b071df3afc312ae0aef3b5ed6c014cc7
2020-05-13 21:19:03 -07:00
Hong Xu
336e1ec592 Clean up error handling in is_nonzero and where in TensorCompare.cpp (#38150)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38150

Differential Revision: D21539736

Pulled By: ezyang

fbshipit-source-id: e390c12f5948192a552d66dcd1bb89b2cb45f170
2020-05-13 20:19:40 -07:00
kshitij12345
d86de916a9 Migrate exp and exp_ from the TH to Aten (CUDA) (#36652)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24561

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.exp(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.exp(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.exp(a) a.numel() == 10000 for 20000 times torch.half
0.3001665159999902
torch.exp(a) a.numel() == 10000 for 20000 times torch.float
0.28265794499998265
torch.exp(a) a.numel() == 10000 for 20000 times torch.double
0.3432170909998149
torch.exp(a) a.numel() == 100000 for 20000 times torch.half
0.32273333800003456
torch.exp(a) a.numel() == 100000 for 20000 times torch.float
0.31498759600003723
torch.exp(a) a.numel() == 100000 for 20000 times torch.double
1.079708754999956
```

After:

```
torch.exp(a) a.numel() == 10000 for 20000 times torch.half
0.27996097300092515
torch.exp(a) a.numel() == 10000 for 20000 times torch.float
0.2774473429999489
torch.exp(a) a.numel() == 10000 for 20000 times torch.double
0.33066844799941464
torch.exp(a) a.numel() == 100000 for 20000 times torch.half
0.27641824200145493
torch.exp(a) a.numel() == 100000 for 20000 times torch.float
0.27805968599932385
torch.exp(a) a.numel() == 100000 for 20000 times torch.double
1.0644143180015817
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36652

Differential Revision: D21164653

Pulled By: VitalyFedyunin

fbshipit-source-id: 42c7b24b0d85ff1d390231f1457968a8869b8db3
2020-05-13 10:06:51 -07:00
Natalia Gimelshein
3d968088e0 fix multinomial kernels to properly advance random states (#38046)
Summary:
Before, multinomial kernels did not advance random states enough, which lead to the same sequence being generated over and over with a shift of 4. This PR fixes that.
Fixes https://github.com/pytorch/pytorch/issues/37403
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38046

Differential Revision: D21516542

Pulled By: ngimel

fbshipit-source-id: 23248a8c3a5c44316c4c35cd71a8c3b5f76c90f2
2020-05-12 22:33:11 -07:00
Pavel Belevich
70c6550cc9 Forgotten changes for Tensor.random_()'s from and to bounds for floating-point types (#38287)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38287

Test Plan: Imported from OSS

Differential Revision: D21534847

Pulled By: pbelevich

fbshipit-source-id: 6ea972186789347555efbbf68407b5f12960dae6
2020-05-12 19:09:37 -07:00
Emilio Castillo
f7e7a15a5d Fix NaN comparison in torch.median (#38216)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/38018

when calling `eq_with_nan(v, kValue)` having `v` and `kValue` both `nan` is returning `false` when it should be `true`.
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/SortingKthValue.cu#L76

The implementation is using intrinsics such as `__double_as_longlong` and comparing their bit representations. But the values of the bits obtained for both nans are different.
`9221120237041090560` for `v`
`9223372036854775807` for `kValue`

two different nans have different bit representations, so we have to do additional comparisons to fix this.

I changed this comparison and it seems to be working now.
However, when compared to a CPU implementation, the returned indices for the values seems to be random but valid.
Probably this is an effect of the comparison order in the Cuda version.
I am not sure if this is ok since all the indices point to valid elements.

For the snippet in the issue I get the following:

```
# CUDA Values
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       device='cuda:0', dtype=torch.float64)
# CUDA indices
tensor([304, 400, 400, 528, 304, 304, 528, 336, 304, 432, 400, 280, 280, 336,
        304, 336, 400, 304, 336, 560], device='cuda:0')
```
```
# CPU values
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       dtype=torch.float64)
# CPU indices
tensor([515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515, 515,
        515, 515, 515, 515, 515, 515])
```

Also, maybe its better to change the `eq_with_nan` implementations to address this instead?
I am not sure if this will cause code to break in other places though ...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38216

Differential Revision: D21517617

Pulled By: ngimel

fbshipit-source-id: deeb7bb0ac519a03aa0c5f365005a9150e6404e6
2020-05-12 18:27:14 -07:00
Cloud Han
8ab6377273 Port atan from TH to ATen (#37991)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/24538
Related https://github.com/pytorch/pytorch/issues/24507
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37991

Differential Revision: D21531741

Pulled By: VitalyFedyunin

fbshipit-source-id: c762cc80416d7fffbb1769c6cc5e0914ceaa8e2d
2020-05-12 14:22:26 -07:00
Ailing Zhang
7c13a07286 [Reland] Remove uses of type() part 2 (#38288)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/38140. It got reverted since it broke slow tests which were only run on master branch(thanks mruberry !). Enabling all CI tests in this PR to make sure they pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38288

Reviewed By: mruberry

Differential Revision: D21524923

Pulled By: ailzhang

fbshipit-source-id: 3a9ecc7461781066499c677249112434b08d2783
2020-05-12 13:37:14 -07:00
Emilio Castillo
779abf7538 Implements torch.pow for complex on cuda and enables complex values as exponents for pow (#36793)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36744

It also allows to call pow on the cpu with complex values as exponent, which was not possible before.

TODO: Add tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36793

Differential Revision: D21525514

Pulled By: anjali411

fbshipit-source-id: c4624c97b194cb1d942e5dd0ee9042adf7586ed3
2020-05-12 11:28:44 -07:00
Anjali Chourdia
ba0851326c Revert D21449462: [CUDA] addmv for complex tensors
Test Plan: revert-hammer

Differential Revision:
D21449462

Original commit changeset: 1f2dd5a7f8a4

fbshipit-source-id: 4f5f035668d1de4469d11ddeb08a77340eb52f98
2020-05-12 05:21:11 -07:00
anjali411
0d977e9223 [CUDA] addmv for complex tensors (#37940)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37940

Test Plan: Imported from OSS

Differential Revision: D21449462

Pulled By: anjali411

fbshipit-source-id: 1f2dd5a7f8a42d3ba92a1b1a286f35454392a06d
2020-05-11 21:46:52 -07:00
anjali411
375ddb01b5 Fix tensor printing (#38031)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38031

Test Plan: Imported from OSS

Differential Revision: D21502915

Pulled By: anjali411

fbshipit-source-id: 0cc3017a390da55af47ba81f651a883cd52b10da
2020-05-11 19:59:19 -07:00
kshitij12345
a37b865107 test_linspace : remove explicit for-loop (#38191)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/38187

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CPU : Intel® Core i5-8300H CPU @ 2.30GHz × 8
GPU : GTX 1050ti

Test Cmd : `pytest test/test_torch.py -k linspace_cpu_float`

Before :
```
test/test_torch.py ..                                                                                                                                                         [100%]

======================================================================== 2 passed, 5170 deselected in 24.43s ========================================================================
```

After :
```
test/test_torch.py ..                                                                                                                                                         [100%]

======================================================================== 2 passed, 5170 deselected in 9.20s =========================================================================
```

Test Cmd : `pytest test/test_torch.py -k linspace_cuda_float`

Before :
```
test/test_torch.py ......                                                                                                                                                     [100%]

=================================================================== 6 passed, 5166 deselected in 83.84s (0:01:23) ===================================================================
```

After :
```
test/test_torch.py ......                                                                                                                                                     [100%]

======================================================================== 6 passed, 5166 deselected in 40.18s ========================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38191

Differential Revision: D21494478

Pulled By: mruberry

fbshipit-source-id: fa58f727781425937a7b8212f9b63a739935eb86
2020-05-11 15:17:47 -07:00
Mike Ruberry
f6b1c046b6 Revert D21483808: [pytorch][PR] Remove uses of type() part 2
Test Plan: revert-hammer

Differential Revision:
D21483808

Original commit changeset: 12f5de6151ba

fbshipit-source-id: 2755fa97ae3f342ae88b1531acfa790772a27c17
2020-05-09 00:42:39 -07:00
Ailing Zhang
86d28706e0 Remove uses of type() part 2 (#38140)
Summary:
I'm mostly done with cleaning up test/ folder. There're a bunch of remaining callsites but they're "valid" in testing `type()` functionalities. We cannot remove them until it's fully deprecated.
Next PR would mainly focus on move some callsites to an internal API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38140

Differential Revision: D21483808

Pulled By: ailzhang

fbshipit-source-id: 12f5de6151bae59374cfa0372e827651de7e1c0f
2020-05-08 19:30:46 -07:00
Xiao Wang
63b1ae6983 Fix overflow in torch.remainder when dividend is very large (#37758)
Summary:
This will fix the GPU implementation in https://github.com/pytorch/pytorch/issues/37743 and https://github.com/pytorch/pytorch/issues/24861. Please also check my [comment](https://github.com/pytorch/pytorch/issues/37743#issuecomment-623285707).

The fixed `remainder_kernel` follows the similar implementation in numpy. See 79d7bc276a/numpy/core/src/npymath/npy_math_internal.h.src (L649-L658)

I also slightly update the doc for `torch.remainder`, to make it similar to `torch.fmod`.

I'm not sure how to modify the Vec256 code of CPU remainder_kernel, so I just leave it there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37758

Differential Revision: D21388417

Pulled By: ngimel

fbshipit-source-id: 770ba5801cf34619b2b68b8b0cf95d8cfa52e6f6
2020-05-08 16:46:55 -07:00
Donna Choi
ca2206d071 Add documentation for FeatureAlphaDropout (#36295)
Summary:
These changes add documentation for FeatureAlphaDropout, based on a need raised in an issue by SsnL (Issue https://github.com/pytorch/pytorch/issues/9886).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36295

Differential Revision: D21478591

Pulled By: zou3519

fbshipit-source-id: a73c40bf1c7e3b1f301dc3347cef7b32e9842320
2020-05-08 15:09:01 -07:00
Ralf Gommers
726aa713d5 Replace torch.is_tensor usages with isinstance checks. (#38062)
Summary:
`is_tensor` doesn't really have a reason to exist anymore (other than
backwards compatibility) and is worse for typechecking with mypy (see
gh-32824). Given that it may not be obvious what the fix is once mypy
gives an error, make the change in a number of places at once, and add
a note on this to the `is_tensor` docstring.

Recommending an isinstance check instead has been done for quite a
while, e.g. https://github.com/pytorch/pytorch/pull/7769#discussion_r190458971
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38062

Differential Revision: D21470963

Pulled By: ezyang

fbshipit-source-id: 98dd60d32ca0650abd2de21910b541d32b0eea41
2020-05-08 10:10:11 -07:00
Chris Paulse
deeef50432 Check the _geev input matrix for NaNs and infs (#37642)
Summary:
If we don't do this we risk a segmentation fault from the Intel MKL.
Fixes https://github.com/pytorch/pytorch/issues/37499
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37642

Differential Revision: D21465181

Pulled By: pbelevich

fbshipit-source-id: 809dca11f11de91018d978578bc11737b879d6ec
2020-05-07 21:33:37 -07:00
Edward Yang
c2f787ce77 Give _VariableFunctions class a different name, so pickling works (#38033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38033

Pickles require class names to be actually accessible from the module
in question.  _VariableFunction was not!  This fixes it.

Fixes https://github.com/pytorch/pytorch/issues/37703

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21458068

Pulled By: ezyang

fbshipit-source-id: 2a5ac41f9d1972e300724981b9b4b84364ddc18c
2020-05-07 20:34:21 -07:00
Michael Carilli
35693e9b4b Give at::cuda::blas::gemv<at::Half> parity with <float> and <double>. Nature is healing. (#37569)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37157 on my machine.

This was annoying to track down.  The essence is that cublas expects column major inputs and Pytorch tensors are usually row major.  Cublas lets you request that it act on transposed data, and the erroring `gemv` calls in https://github.com/pytorch/pytorch/issues/37157 make that request.  The problem is, [cublasSgemv and cublasDgemv](https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemv) (called by [`gemv<float>`](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L318)) and `gemv<double>`) regard their `m, n` arguments values as _pre_-transpose sizes, while [cublasGemmEx](https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx) (called by `gemv<at::Half>`, see [here](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L342)) and [here](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L229))) regards its `m, k` argument values as _post_-transpose sizes.  This is inconsistent.  It turns out the `gemv<float>/<double>` calls are configured correctly and the `gemv<at::Half>` calls aren't.

Strikethrough text below is no longer accurate, ngimel suggested a better way to handle gemv->gemm forwarding.  [Comments in code](https://github.com/pytorch/pytorch/pull/37569/files#diff-686aa86335f96b4ecb9b37f562feed12R323-R348) provide an up-to-date explanation.

Keeping out-of-date strikethrough text because I don't have the heart to delete it all and because it captures an intermediate state of my brain that will help orient me if i ever have to fix this again.

~~To convince myself this PR keeps `at::cuda::blas::gemv`'s external API consistent across dtypes, I need to think through what happens when a pytorch tensor input of size `(a,b)` multiples a vector of size `(b,)` for 4 cases:~~

### ~~1. input is row-major (needs cublas internal transpose)~~
#### ~~1a. input is float or double~~
~~`gemv<float>/<double>` call `cublasS/Dgemv`, forwarding `trans`,** `m`, and `n` directly.~~

~~`cublasS/Ggemv` expects "a m × n matrix stored in column-major format" (so m is the input's fast dim).  Input has size `(a, b)` in row-major format.  We can reinterpret it as a column-major matrix with size `(b, a)` without any memory movement.  So the gemv call should supply `m=b`, `n=a`.  However, we're not trying to multiply a matrix `(b, a)` x a vector `(b,)`, we're trying to sum across `b` for matrix and vector.  So we also request that cublas transpose the matrix internally by supplying `trans='t'` to `blas::gemv`, which becomes `trans=CUBLAS_OP_T` to the `cublasS/Ggemv`.~~

~~As long as the code calling `blas::gemv` thinks carefully and passes `trans='t'`, `m=b`, `n=a`, cublas carries out `(a, b) x (b,)` and all is well.~~

#### ~~1b. input is half or bfloat16~~
~~`blas::gemv<at::Half>` takes a different code path, calling `gemm<at::Half>` which calls `cublasGemmEx`.  The job of this PR is to make sure the exterior `blas::gemv` caller's carefully thought-out argument choices (`trans='t'`, `m=b`, `n=a`) remain correct.~~

~~`cublasGemmEx` takes args `transa, transb, m, n, k, ....others we don't care about` and carries out~~
```
 C = α op ( A ) op ( B ) + β C
where α and β are scalars, and A , B and C are matrices stored in column-major format with
dimensions op ( A ) m × k , op ( B ) k × n and C m × n Also, for matrix A
           A if  transa == CUBLAS_OP_N
op ( A ) = A^T if  transa == CUBLAS_OP_T ...
```
~~`gemv<at::Half>` hacks a gemv by calling gemm such that the raw gemm's `m` is the output dim, `k` is the summed dim, and `n=1`, .  Reasonable, as long as we get the values right, given that we also need to transpose the input.~~

~~To conform with cublas docs we interpret input as column-major with size `(b, a)`.  As for the `<float>/<double>` gemv we want cublas to carry out input (interpreted as column major), internally transposed, times vector of size `(b,)`.  In other words we want cublas to apply `op(A) x B`, where op is transpose and `A` is input interpreted as column major.  Docs define `m` and `k` by saying `op(A)` has dims `m x k` **(`m` and `k` are _post_-`op` sizes)**.  `A` was `(b, a)`, `op(A)` is `(a, b)`, so the correct thing is to supply `m=a`, `k=b` to the underlying gemm.  **For the `<float>/<double>` gemv, we passed `m=b`, not `m=a`, to the raw `cublasS/Dgemv`.**~~

~~The exterior `blas::gemv` must have been called with `trans='t'`, `m=b`, `n=a` (as required by the `<float>/<double>` versions).  So when gemv is about to call gemm, **we [swap](https://github.com/pytorch/pytorch/pull/37569/files#diff-686aa86335f96b4ecb9b37f562feed12R330) the local values of `m` and `n` so that `m=a`, `n=b`,** then put `m (=a)` in the gemm's `m` spot, 1 in the gemm's `n` spot, and `n (=b)` in the gemm's `k` spot.  All is well (we made the right gemm call after ingesting the same arg values as `blas::gemv<float>/<double>`).~~

### ~~2. input is column-major (doesn't need cublas transpose)~~
#### ~~2a. input is float or double~~
~~input is `(a,b)`, already column-major with strides `(1,a)`.  Code calling `blas::gemv` supplies `trans='n'` (which becomes `CUBLAS_OP_N`, no internal transpose), `m=a`, `n=b`.~~

#### ~~2b. input is half or bfloat16~~
~~`blas::gemv` should pass `transa='n'`, `m=a`, `n=1`, `k=b` to the underlying gemm. The exterior `blas::gemv` must have been called with `trans='t'`, `m=a`, `n=b` (as required by the `<float>/<double>` versions). So **in this case we _don't_ swap `blas::gemv`'s local values of `m` and `n`.** We directly put `m (=a)` in the gemm's `m` spot, 1 in the gemm's `n` spot, and `n (=b)` in the gemm's `k` spot. All is well (we made the right gemm call after ingesting the same arg values as `blas::gemv<float>/<double>`).~~

~~** `trans` is a string `t` or `n` in the `at::cuda::blas::gemv` API, which gets [converted](091a1192d7/aten/src/ATen/cuda/CUDABlas.cpp (L314)) to a corresponding cublas enum value `CUBLAS_OP_T` (do transpose internally) or `CUBLAS_OP_N` (don't transpose internally) just before the raw cublas call.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37569

Differential Revision: D21405955

Pulled By: ngimel

fbshipit-source-id: e831414bbf54860fb7a4dd8d5666ef8081acd3ee
2020-05-06 18:19:30 -07:00
anjali411
4c4816ad07 [CPU] addmv for complex tensors (#37924)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37924

Test Plan: Imported from OSS

Differential Revision: D21429384

Pulled By: anjali411

fbshipit-source-id: 8b1b76ed13d2e5785a4d552aedb2e6f58d304c46
2020-05-06 14:13:05 -07:00
Gao, Xiang
b57b596f20 Reduction should not coalesce_dimensions when splitting for 32bit indexing (#37788)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37788

Differential Revision: D21387325

Pulled By: ngimel

fbshipit-source-id: dbd0f5a23e06d8c4cc68cd21b09b4b0221c4bba7
2020-05-05 23:44:00 -07:00
rohithkrn
e3934dfae8 [ROCm] Enable bfloat16 for ops in BERT model (#37634)
Summary:
Enables bfloat16 type for ops present in BERT model.
Enabled relevant unit tests.

ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37634

Differential Revision: D21413957

Pulled By: ezyang

fbshipit-source-id: 19309fe46b4a2f07922bf5b32fee2066df514aeb
2020-05-05 21:24:56 -07:00
Hong Xu
3b97723f08 Let >> and << support half on CUDA (#37670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37670

Differential Revision: D21395325

Pulled By: ngimel

fbshipit-source-id: fcb02f3bee488717cdc1ffc05204970b907d3c3f
2020-05-05 10:10:37 -07:00
kshitij12345
145560f499 Migrate erf and erf_ from the TH to Aten (CUDA) : Closes #24558 (#36724)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24558
Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.erf(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.erf(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.erf(a) a.numel() == 10000 for 20000 times torch.half
0.29057903600187274
torch.erf(a) a.numel() == 10000 for 20000 times torch.float
0.2836507789979805
torch.erf(a) a.numel() == 10000 for 20000 times torch.double
0.44974555500084534
torch.erf(a) a.numel() == 100000 for 20000 times torch.half
0.31807255600142526
torch.erf(a) a.numel() == 100000 for 20000 times torch.float
0.3216503109979385
torch.erf(a) a.numel() == 100000 for 20000 times torch.double
2.0413486910001666
```

After:

```
torch.erf(a) a.numel() == 10000 for 20000 times torch.half
0.2867302739996376
torch.erf(a) a.numel() == 10000 for 20000 times torch.float
0.28851128199858067
torch.erf(a) a.numel() == 10000 for 20000 times torch.double
0.4592030350013374
torch.erf(a) a.numel() == 100000 for 20000 times torch.half
0.28704102400115517
torch.erf(a) a.numel() == 100000 for 20000 times torch.float
0.29036039400125446
torch.erf(a) a.numel() == 100000 for 20000 times torch.double
2.04035638699861
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36724

Differential Revision: D21164626

Pulled By: VitalyFedyunin

fbshipit-source-id: e6f3390b2bbb6e8d21e18ffe15f5d49a170fae83
2020-05-05 09:22:54 -07:00
Pavel Belevich
812a3fa03d Show warning if Tensor.random_()'s from and to are not in [-(2^digits), 2^digits] bounds for floating-point types (#37537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37537

The documentation states that `random_()` samples "from the discrete uniform distribution". Floating-point types can support _discrete_ _uniform_ distribution only within range [-(2^digits), 2^digits], where `digits = std::numeric_limits<fp_type>::digits`, or

- [-(2^53), 2^53] for double
- [-(2^24), 2^24] for double
- [-(2^11), 2^11] for half
- [-(2^8), 2^8] for bfloat16

The worst scenario is when the floating-point type can not represent numbers between `from` and `to`. E.g.
```
torch.empty(10, dtype=torch.float).random_(16777217, 16777218)
tensor([16777216., 16777216., 16777216., 16777216., 16777216., 16777216.,
        16777216., 16777216., 16777216., 16777216.])
```
Because 16777217 can not be represented in float

Test Plan: Imported from OSS

Differential Revision: D21380387

Pulled By: pbelevich

fbshipit-source-id: 80d77a5b592fff9ab35155a63045b71dcc8db2fd
2020-05-04 10:36:04 -07:00
ashishfarmer
bcdff7eb67 Fix for tests on ROCm (#37616)
Summary:
This pull request fixes and re-enables two of the tests disabled in https://github.com/pytorch/pytorch/issues/37427
1. `test_sparse_add_out_bfloat16` in test_sparse.py fixed to use updated `atol` argument instead of `prec` for `assertEqual`
2. The conversion of `flt_min` to `int64` is divergent on HIP compared to numpy. The change removes that conversion from the `test_float_to_int_conversion_finite` test case in test_torch.py

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37616

Differential Revision: D21379876

Pulled By: ezyang

fbshipit-source-id: 2bfb41d67874383a01330c5d540ee516b3b07dcc
2020-05-04 07:16:54 -07:00
Pavel Belevich
b1790794f6 Enforce Tensor.random_ check that from and to are in tensor dtype bounds (#37507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37507

Replace `TORCH_WARN` with `TORCH_CHECK` if `Tensor.random_()`'s `from` or `to-1` is out of bounds for tensor's dtype. Previously warning said "This warning will become an error in version 1.6 release, please fix the code in advance", so the time has come.

Related to #33106

Test Plan: Imported from OSS

Differential Revision: D21349413

Pulled By: pbelevich

fbshipit-source-id: ac7c196a48fc58634611e427e65429a948119e40
2020-05-01 12:58:45 -07:00
anjali411
1f09f7ea44 Python API for Complex Storage and storage copy logic (#35771)
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771

Differential Revision: D21319650

Pulled By: anjali411

fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
2020-05-01 11:47:22 -07:00
kshitij12345
22708be5af Migrate tan from TH to ATen (CUDA) (#36906)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24641

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.tan(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.tan(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28325206200003095
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.28363607099998944
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.43924326799998425
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3754699589999859
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.38143782899999223
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7672172019999834
```

After:

```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28982524599996395
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.29121579000002384
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.4599610559998837
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3557764019997194
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.34793807599999127
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7564662459999454
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36906

Differential Revision: D21335320

Pulled By: VitalyFedyunin

fbshipit-source-id: efab9c175c60fb09223105380d48b93a81994fb0
2020-05-01 10:17:19 -07:00
Hong Xu
cd48fb5030 Vectorize linspace on CPU. (#27957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27957

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.linspace(0, 10, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.linspace(0, 10, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.3964195849839598
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
1.2374563289922662
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.8631796519621275
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
1.6991038109990768
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.8358083459897898
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.7214750979910605
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.8356257299892604
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.706238206999842
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
1.7463878280250356
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
1.6172360889613628
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
1.8656846070080064
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
1.714238062966615
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
1.8272205490502529
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
1.6409171230043285
```

After:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.0077099470072426
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.8227124120458029
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.0058343949494883
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.8376779520185664
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.903041019977536
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.7576498500420712
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.7628699769848026
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.6204477970022708
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
2.0970272019621916
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
1.9493417189805768
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
2.29020385700278
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
2.1212510910118
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
2.3479344319785014
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
2.156775983981788
```

Test Plan: Imported from OSS

Differential Revision: D20773454

Pulled By: VitalyFedyunin

fbshipit-source-id: ebeef59a90edde581669cc2afcc3d65929c8ac79
2020-04-30 14:26:24 -07:00
kshitij12345
7e9cc4df85 Migrate cos and cos_ from TH to ATen (CUDA) (#36653)
Summary:
Benchmark with same build settings on same system.

Closes https://github.com/pytorch/pytorch/issues/24545
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.cos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.cos(a); torch.cuda.synchronize()',
                             setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                             number=t))
```

Before:

```
torch.cos(a) a.numel() == 10000 for 20000 times torch.half
0.2797315450006863
torch.cos(a) a.numel() == 10000 for 20000 times torch.float
0.283109110998339
torch.cos(a) a.numel() == 10000 for 20000 times torch.double
0.3648525129974587
torch.cos(a) a.numel() == 100000 for 20000 times torch.half
0.34239949499897193
torch.cos(a) a.numel() == 100000 for 20000 times torch.float
0.33680364199972246
torch.cos(a) a.numel() == 100000 for 20000 times torch.double
1.0512770260102116
```

After:

```
torch.cos(a) a.numel() == 10000 for 20000 times torch.half
0.285825898999974
torch.cos(a) a.numel() == 10000 for 20000 times torch.float
0.2781305120001889
torch.cos(a) a.numel() == 10000 for 20000 times torch.double
0.34188826099989456
torch.cos(a) a.numel() == 100000 for 20000 times torch.half
0.29040409300023384
torch.cos(a) a.numel() == 100000 for 20000 times torch.float
0.28678944200009937
torch.cos(a) a.numel() == 100000 for 20000 times torch.double
1.065477349000048
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36653

Differential Revision: D21164675

Pulled By: VitalyFedyunin

fbshipit-source-id: 5dd5d3af47c2a5527e1f4ab7669c2ed9a2293cee
2020-04-29 15:52:24 -07:00
Jesse Brizzi
bca82801e7 add support for generating Vandermonde matrices (#36725)
Summary:
Adds support for generating Vandermonde matrices based off of the Numpy implementation found [here](https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/twodim_base.py#L475-L563).

Adds test to ensure generated matrix matches expected Numpy implementation. Note test are only limited to torch.long and torch.double due to differences in now PyTorch and Numpy deal with type promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36725

Differential Revision: D21075138

Pulled By: jessebrizzi

fbshipit-source-id: 6bb1559e8247945714469b0e2b07c6f4d5fd1fd0
2020-04-29 13:16:26 -07:00
Nikita Shulga
1bb66a0cd4 Extend some of the basic ops to kHalf (#37121)
Summary:
Added enough operators to make sure that all unit tests from ATen/basic are passing, except for MM and IntArrayRefExpansion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37121

Test Plan: `./bin/basic --gtest_filter=--gtest_filter=BasicTest.BasicTestHalfCPU` + `python -c "import torch; x = torch.tensor([2], dtype=torch.half); print(torch.isfinite(x+x))"`

Differential Revision: D21296863

Pulled By: malfet

fbshipit-source-id: e03d7a6939df11f611a9b317543bac52403cd009
2020-04-29 10:49:16 -07:00
ashishfarmer
bbd2350c99 Disable tests failing on test2 in ROCm CI (#37427)
Summary:
This pull request disables the unit tests that were observed to be failing once `test2` was enabled. These tests will be one by one looked at and fixed at the earliest, but until then disabling them to unblock `test2`
The pull request also disables fftPlanDestroy for rocFFT to avoid double-freeing FFT handles

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37427

Differential Revision: D21302909

Pulled By: ezyang

fbshipit-source-id: ecadda3778e65b7f4f97e24b932b96b9ce928616
2020-04-29 09:56:28 -07:00
Pavel Belevich
ec8517b6df Move exponential_() to DistributionTemplates (#37456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37456

Fixes #37370

Test Plan: Imported from OSS

Differential Revision: D21290781

Pulled By: pbelevich

fbshipit-source-id: 2f516b5112b9ce1c9ba8967b3758decf86d65676
2020-04-29 08:07:35 -07:00
Pavel Belevich
06168bf17d Move geometric_() to DistributionTemplates (#37418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37418

Fixes #37369

Test Plan: Imported from OSS

Differential Revision: D21290757

Pulled By: pbelevich

fbshipit-source-id: 42133f35edcbe716a07987bef2e68a4cdc27236a
2020-04-29 08:07:30 -07:00
Pavel Belevich
ce6077d7a8 Move log_normal_() to DistributionTemplates (#37392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37392

Fixes #37368

Test Plan: Imported from OSS

Differential Revision: D21290740

Pulled By: pbelevich

fbshipit-source-id: 15a76b2625d2ca8187c25333a86eecd111a259c6
2020-04-29 08:06:05 -07:00
kshitij12345
4e3dc34c47 add complex support to reciprocal_cuda kernel (#36749)
Summary:
dylanbespalko anjali411

Not sure if the test should be added to `test_torch` or `test_complex`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36749

Differential Revision: D21290529

Pulled By: anjali411

fbshipit-source-id: 07bc282e4c9480cd015ec5db104e79728437cd90
2020-04-28 21:51:46 -07:00
Emilio Castillo
273c464145 Fix TensorIterator::view_offsets_ size (#37214)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37084

There are 3 alternatives for this design.

This PR and the first one.
When a tensor is a scalar `ndim==0`, accessing view_offsets_[0] when doing reductions, yields an invalid offset for the index which is the output of `argmax` and `argmin`.

fba9b9a023/aten/src/ATen/native/cpu/Reduce.h (L217)

This also happens in cuda code:
fba9b9a023/aten/src/ATen/native/cuda/Reduce.cuh (L797)

The second alternative is to check the size of `view_offsets` before accessing it. But this introduces some burden.

The third alternative is related to the way that inputs are treated in `argmax` and `argmin`
depending on the `dim` argument value.

fba9b9a023/aten/src/ATen/native/ReduceOps.cpp (L775-L780)

If `dim` is not specified, then the scalar gets reshaped into a 1-dim tensor and everything works properly, since now `view_offsets` has an actual entry.
If dim is specified, then the input remains as a scalar causing the issue we see here.

This PR tries to solve it in a generic way for every case so I went with option 1. I am willing to discuss it and change if you think that the other alternatives are better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37214

Differential Revision: D21258320

Pulled By: ngimel

fbshipit-source-id: 46223412187bbba4bfa7337e3f1d2518db72dea2
2020-04-28 18:08:51 -07:00
anjali411
b8ec165c0d Fix failing test in test_torch.py (#37362)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37362

Differential Revision: D21264829

Pulled By: anjali411

fbshipit-source-id: cec6af84630378f03cb3863c85e161776af236cd
2020-04-27 16:42:11 -07:00
Mike Ruberry
b64fc3c4b5 Changes warnings generated in cpp to show point of Python origination (#36052)
Summary:
Today in PyTorch, warnings triggered in C++ are printed to Python users like this:

`../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.`

This may be unhelpful to Python users, who have complained it's difficult to relate these messages back to their programs. After this PR, warnings that go through the PyWarningHandler and allow it to add context print like this:

```
test/test_torch.py:16463: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:81.)
  cpu_result = getattr(cpu_tensor, op_str)(*cpu_args)
```

This relates the warning back to the user's program. The information about the cpp file and line number is preserved in the body of the warning message.

Some warnings, like those generated in the JIT, already account for a user's Python context, and so they specify that they should be printed verbatim and are unaffected by this change. Warnings originating in Python and warnings that go through c10's warning handler, which prints to cerr, are also unaffected.

A test is added to test_torch.py for this behavior. The test relies on uint8 indexing being deprecated and its warning originating from its current header file, which is an unfortunate dependency. We could implement a `torch.warn` function, instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36052

Differential Revision: D20887740

Pulled By: mruberry

fbshipit-source-id: d3515c6658a387acb7fccaf83f23dbb452f02847
2020-04-25 21:18:58 -07:00
Xiang Gao
d7f7c290e3 addmv migration [resubmit] (#37236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37236

Differential Revision: D21232988

Pulled By: anjali411

fbshipit-source-id: ac6c0ee018aef3c841b039d76e6e1fbb3cd0292d
2020-04-25 07:43:27 -07:00
anjali411
4f3946a89b Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#37193)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes

Old PR - https://github.com/pytorch/pytorch/pull/36747
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37193

Differential Revision: D21229373

Pulled By: anjali411

fbshipit-source-id: 8a086136d8c10dabe62358d276331e3f22bb2342
2020-04-24 15:05:50 -07:00
Alexander Fix
2baff9476e Test test_is_nonzero make expected exception inline (#37128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37128

In certain build modes (in fbcode, building a .par) the mechanism to get test output "expect" files doesn't work.
All other tests in test_torch.py already had assertExpectedInline instead of assertExpected, with the expected result inline in the file.
There was no equivalent for assertExpectedRaises, so I added one, and changed the tests for test_is_nonzero (the only test using this)

Test Plan: CI, specifically the test test_is_nonzero should pass

Reviewed By: malfet

Differential Revision: D21197651

fbshipit-source-id: 2a07079efdcf1f0b0abe60e92cadcf55d81d4b13
2020-04-24 13:12:31 -07:00
moto
5a27ec09b8 Add Inverse Short Time Fourier Transform in ATen native (#35569)
Summary:
Ported `torchaudio`'s implementation (test, and documentation as well) to ATen.

Note
 - Batch packing/unpacking is performed in Python. ATen implementation expects 4D input tensor.
 - The way `hop_length` is initialized in the same way as `stft` implementation. [The Torchaudio's version tried to mimic the same behavior but slightly different](7da61a4bee/torchaudio/functional.py (L152-L157)).

Closes https://github.com/pytorch/pytorch/issues/34827
Relates https://github.com/pytorch/pytorch/issues/3775
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35569

Differential Revision: D21178090

Pulled By: mthrok

fbshipit-source-id: 2701a8b241a36a6fb1b740c2fb2b07cb938185d4
2020-04-24 12:14:55 -07:00
kshitij12345
e98cdfa26f Migrate tanh from TH to ATen (CUDA) (#36995)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24642

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.tanh(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.tanh(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.tanh(a) a.numel() == 10000 for 20000 times torch.half
0.2816318240002147
torch.tanh(a) a.numel() == 10000 for 20000 times torch.float
0.2728829070001666
torch.tanh(a) a.numel() == 10000 for 20000 times torch.double
0.39797203200214426
torch.tanh(a) a.numel() == 100000 for 20000 times torch.half
0.3228214350019698
torch.tanh(a) a.numel() == 100000 for 20000 times torch.float
0.31780802399953245
torch.tanh(a) a.numel() == 100000 for 20000 times torch.double
1.3745740449994628
```

After:

```
torch.tanh(a) a.numel() == 10000 for 20000 times torch.half
0.27825374500025646
torch.tanh(a) a.numel() == 10000 for 20000 times torch.float
0.27764024499992956
torch.tanh(a) a.numel() == 10000 for 20000 times torch.double
0.3771585260001302
torch.tanh(a) a.numel() == 100000 for 20000 times torch.half
0.2995866400015075
torch.tanh(a) a.numel() == 100000 for 20000 times torch.float
0.28355561699936516
torch.tanh(a) a.numel() == 100000 for 20000 times torch.double
1.393811182002537
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36995

Differential Revision: D21163353

Pulled By: ngimel

fbshipit-source-id: e2216ff62cdfdd13b6a56daa63d4ef1440d991d4
2020-04-23 12:29:27 -07:00
Taylor Robie
7aec364bdf extend gather shape check to handle incorrectly sized outputs (#37102)
Summary:
Fixes a safety issue (Nonsense values and segfaults) introduced by https://github.com/pytorch/pytorch/pull/36875 when in-place gather tries to use incorrect shapes.

Consider the following block of code:
```
k0 = 8
k1 = 8
m = 100

x = torch.rand((k0, k1))
ind = torch.randint(0, k0, (m, k1))
output = torch.empty((m, k1))

print(torch.gather(x, 0, ind, out=output))
print(torch.gather(x, 1, ind, out=output))
```

The first gather is legal, the second is not. (`ind` and `output` need to be transposed) Previously this was caught when the kernel tried to restride inputs for TensorIterator, but we can no longer rely on those checks and must test explicitly. If `m` is small the second gather returns gibberish; if it is large enough to push the read out of memory block the program segfaults.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37102

Differential Revision: D21190580

Pulled By: robieta

fbshipit-source-id: 80175620d24ad3380d78995f7ec7dbf2627d2998
2020-04-23 11:47:01 -07:00
Anjali Chourdia
c306f2ed08 Revert D20660338: [pytorch][PR] Migrate addmv and mv from legacy to ATen native (CUDA & CPU)
Test Plan: revert-hammer

Differential Revision:
D20660338

Original commit changeset: db1f521f1241

fbshipit-source-id: 8616ddd7bbd8f00351cfc45331a09b0bc9aa28ea
2020-04-23 10:46:45 -07:00
Gao, Xiang
a38c6e0454 Migrate addmv and mv from legacy to ATen native (CUDA & CPU) (#30898)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/24605 https://github.com/pytorch/pytorch/issues/24535 https://github.com/pytorch/pytorch/issues/24739 https://github.com/pytorch/pytorch/issues/24680 https://github.com/pytorch/pytorch/issues/30986

This does not fix https://github.com/pytorch/pytorch/issues/29984, it will be fixed in later PR.

Most of this PR is just following the same logic inside TH and THC except the handle of n-dimensional zero-sized tensor, in specific the case:
```
(m,).addmv((m, 0), (0,), beta, alpha)
```

#  Legacy code bugs and how this PR deal with it

The above case is a case where BLAS often have a mismatch of semantics with PyTorch: For BLAS and cuBLAS, the above is a noop, but for PyTorch, it is a scalar-vector multiplication `output = beta * input`. The handle of this case is already very poor in legacy code and it is poorly tested:

For the CPU implementation, there are two code paths:
- Path 1: when dtype is float or double and `USE_BLAS`, then use BLAS
- Path 2: when other dtypes or not `USE_BLAS`, use a fallback kernel in PyTorch

For the CUDA implementation, there are also two code paths:
- Path 1: when float or double, then use `cublasSgemv` or `cublasDgemv` in cuBlas
- Path 2: when half, dispatch to `addmm`

`test_blas_alpha_beta_empty` is supposed to cover all cases, but unfortunately, it only tests the Path 1 of CUDA and Path 1 of CPU, and both uncovered paths (path 2 for CPU and path 2 for CUDA) are buggy in legacy code. In this PR, I expanded the coverage of `test_blas_alpha_beta_empty`, but unfortunately, I have to skip the `half` dtype on CUDA 9. See the description below for detail:

## Bug on CPU implementation

For the CPU implementation, the fallback kernel in path 2 already has the same semantics as PyTorch, not BLAS. But the code that tries to correct BLAS semantics to match PyTorch also runs on this case, leading to double correction, that is, `output = beta * input` now becomes `output = beta * beta * input`.

This leads to the issue https://github.com/pytorch/pytorch/issues/30986 I just opened, and it is fixed in this PR.

## Bug on CUDA implementation

For the CUDA implementation, path 2 dispatches to
```
(m, 1).addmm((m, 0), (0, 1), beta, alpha)
```
But unfortunately, for some old CUDA version when on old GPU on half dtype, the above is also noop, which is definitely not correct.

But from what I see, on newer CUDA version or newer GPU, this is not a problem. This is a bug of PyTorch in `addmm`, so I opened a new issue https://github.com/pytorch/pytorch/issues/31006 to track this problem. But this is highly likely a dependency bug for PyTorch originating from cuBLAS, and it is only on a rarely used edge case on old hardware and software, so this issue would be a `won't_fix` unless some real requirements strongly indicate that this should be fixed.

This issue is already with legacy code, and this PR does not make it worse. To prevent this issue from bothering us, I disable the test of `half` dtype for CUDA 9 when expanding the coverage of `test_blas_alpha_beta_empty`.

I promote a CircleCI CUDA 10.1 test to `XImportant` so that it runs on PRs, because the path 2 of CUDA implementation is only covered by this configuration. Let me know if I should revert this change.

## An additional problem

In legacy code for `addmv`, dtype `bfloat16` is enabled and dispatch to `addmm`, but `addmm` does not support `bfloat16` from what I test. I do the same thing in the new code. Let me know if I should do it differently.

# Benchmark

Code:
```python
import torch
print(torch.__version__)

for i in range(1000):
    torch.arange(i, device='cuda')

print('cpu')
for i in 10, 100, 1000, 10000:
    a = torch.randn((i,))
    b = torch.randn((i, i))
    c = torch.randn((i,))
    %timeit a.addmv(b, c, alpha=1, beta=2)

print('cuda')
for i in 10, 100, 1000, 10000:
    a = torch.randn((i,)).cuda()
    b = torch.randn((i, i)).cuda()
    c = torch.randn((i,)).cuda()
    torch.cuda.synchronize()
    %timeit a.addmv(b, c, alpha=1, beta=2); torch.cuda.synchronize()
```

Before:
```
1.5.0a0+2b45368
cpu
2.74 µs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.5 µs ± 85.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
686 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
74 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cuda
The slowest run took 4.81 times longer than the fastest. This could mean that an intermediate result is being cached.
27.6 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.3 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
20.5 µs ± 369 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
756 µs ± 6.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

After:
```
1.5.0a0+66b4034
cpu
3.29 µs ± 20 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.09 µs ± 7.41 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
687 µs ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
73.8 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cuda
18.2 µs ± 478 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
17.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.5 µs ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
751 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30898

Differential Revision: D20660338

Pulled By: anjali411

fbshipit-source-id: db1f521f124198f63545064026f93fcb16b68f18
2020-04-23 06:56:49 -07:00
Alexander Fix
b889e0da8a [torch] Excluding test_fft_input_modification without MKL (#36680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36680

If torch compiled without MKL, this test fails with torch.fft requiring MKL support

Test Plan: CI

Reviewed By: malfet

Differential Revision: D21051362

fbshipit-source-id: dd2e2c7d323622c1c25fc4c817b85d83d2241b3a
2020-04-22 21:58:02 -07:00
Ailing Zhang
efcbcca454 Revert D21138687: [pytorch][PR] Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex
Test Plan: revert-hammer

Differential Revision:
D21138687

Original commit changeset: ad3602ccf86c

fbshipit-source-id: 69eb031c1a7c3d5e4b9f4241fbdada8d5980535d
2020-04-22 14:49:45 -07:00
Emilio Castillo
5fc391a646 Enforce type promotion in torch.cat (#35030)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35014

CUDA `cat` implementation doesn't use `TensorIterator` so there is the need of manually doing some checks in the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35030

Differential Revision: D21155853

Pulled By: nairbv

fbshipit-source-id: 9e78bb7591f806734e12555831157061c925ff40
2020-04-22 13:35:07 -07:00
kshitij12345
a00d6758b8 Migrate cosh and cosh_ from TH to ATen (CUDA) (#36654)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24546

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.cosh(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.cosh(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.cosh(a) a.numel() == 10000 for 20000 times torch.half
0.2813017509997735
torch.cosh(a) a.numel() == 10000 for 20000 times torch.float
0.28355878599904827
torch.cosh(a) a.numel() == 10000 for 20000 times torch.double
0.27810572300040803
torch.cosh(a) a.numel() == 100000 for 20000 times torch.half
0.3239932899996347
torch.cosh(a) a.numel() == 100000 for 20000 times torch.float
0.321233343998756
torch.cosh(a) a.numel() == 100000 for 20000 times torch.double
0.5546665399997437
```

After:

```
torch.cosh(a) a.numel() == 10000 for 20000 times torch.half
0.2905335750001541
torch.cosh(a) a.numel() == 10000 for 20000 times torch.float
0.27596429500044906
torch.cosh(a) a.numel() == 10000 for 20000 times torch.double
0.30358699899989006
torch.cosh(a) a.numel() == 100000 for 20000 times torch.half
0.30139567500009434
torch.cosh(a) a.numel() == 100000 for 20000 times torch.float
0.30246640400036995
torch.cosh(a) a.numel() == 100000 for 20000 times torch.double
0.5403946970000106

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36654

Differential Revision: D21164606

Pulled By: VitalyFedyunin

fbshipit-source-id: 55e88f94044957f81599ae3c12cda38a3e2c985c
2020-04-22 10:16:24 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
anjali411
25eb250d77 Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#36747)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36747

Differential Revision: D21138687

Pulled By: anjali411

fbshipit-source-id: ad3602ccf86c70294a6e71e564cb0d46c393dfab
2020-04-22 08:52:41 -07:00
Mike Ruberry
4a2372bc90 Implements torch.isclose for complex tensors (#36456)
Summary:
Previously torch.isclose would RuntimeError when called on complex tensors. This update updates torch.isclose to run on complex tensors and be consistent with [NumPy](https://numpy.org/doc/1.18/reference/generated/numpy.isclose.html). However, NumPy's handling of NaN, -inf, and inf values is odd, so I adopted  Python's [cmath.isclose](https://docs.python.org/3/library/cmath.html) behavior when dealing with them. See https://github.com/numpy/numpy/issues/15959 for more on NumPy's behavior.

While implementing complex isclose I also simplified the isclose algorithm to:

- A is close to B if A and B are equal, if equal_nan is true then NaN is equal to NaN
- If A and B are finite, then A is close to B if `abs(a - b) <= (atol + abs(rtol * b))`

This PR also documents torch.isclose, since it was undocumented, and adds multiple tests for its behavior to test_torch.py since it had no dedicated tests.

The PR leaves equal_nan=True with complex inputs an error for now, pending the outcome of https://github.com/numpy/numpy/issues/15959.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36456

Differential Revision: D21159853

Pulled By: mruberry

fbshipit-source-id: fb18fa7048e6104cc24f5ce308fdfb0ba5e4bb30
2020-04-21 19:53:55 -07:00
Mike Ruberry
a850d8a526 Fixes exponential with lambda=0 (#36837)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36798.

In the future more thorough testing would be nice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36837

Differential Revision: D21102342

Pulled By: mruberry

fbshipit-source-id: 4fae45677e54b403296033720dfb13abca47f3a4
2020-04-21 17:34:07 -07:00
Jesse Brizzi
28f439d4f4 add absolute alias for abs (#36597)
Summary:
Adds an absolute alias for the abs function to match Numpy's use of both:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.absolute.html

Adds test to ensure the output from abs and absolute are the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36597

Differential Revision: D21024458

Pulled By: jessebrizzi

fbshipit-source-id: 4f2987e7bc7cde444d0a93e833a0350844b48d44
2020-04-20 14:49:51 -07:00
Mike Ruberry
0f0d69009e Makes CUDA -float->uint8 cast consistent with CPU (#36832)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/36807. Also updates the cast testing to catch issues like this better.

In the future a more constexpr based approach to casting would be nice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36832

Differential Revision: D21120822

Pulled By: mruberry

fbshipit-source-id: 9504ddd36cfe6d9f9f545fc277fef36855c1b221
2020-04-19 23:33:38 -07:00
Natalia Gimelshein
1b3741aa7f [WIP] reenable bfloat16 masked_select (#36859)
Summary:
Try reenabling bfloat16 masked_select, see it windows tests pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36859

Differential Revision: D21109535

Pulled By: ngimel

fbshipit-source-id: ca260943e6575d8e788e9fd87161a0d40d3d44fb
2020-04-19 15:41:32 -07:00
Brian Vaughan
54ed6fd3ee Use both absolute and relative tolerance in testing (#34258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258

This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly.

Test Plan: Imported from OSS

Differential Revision: D21110255

Pulled By: nairbv

fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e
2020-04-19 06:16:49 -07:00
Xiang Gao
6ba734bae9 Vectorize reduction when reducing on fastest striding dimension [resubmit] (#36873)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36873

Differential Revision: D21109194

Pulled By: ngimel

fbshipit-source-id: eb18c6b4394f19a6c5eca45ef4ce97d623e051bd
2020-04-18 16:27:00 -07:00
Yuxin Wu
a64ea8ea04 Back out "Vectorize reduction when reducing on fastest striding dimension" (#36854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36854

Original commit changeset: ea3f7f29709c

Test Plan: n/a

Differential Revision: D21103684

fbshipit-source-id: e4862b32bf9815486e5fa7e05b9816550e9b0263
2020-04-17 19:53:30 -07:00
Xiang Gao
d92005ff73 Vectorize reduction when reducing on fastest striding dimension (#36709)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36709

Test Plan: Imported from OSS

Differential Revision: D21083393

Pulled By: ngimel

fbshipit-source-id: ea3f7f29709c9a6e5b3ec45ba809cb2cf6c5e0c8
2020-04-17 10:12:49 -07:00
Mike Ruberry
d7fabfd5df Implements complex isfinite and isinf (#36648)
Summary:
Implements complex isfinite and isinf, consistent with NumPy.

A complex value is finite if and only if both its real and imaginary part are finite.

A complex value is infinite if and only if its real or imaginary part are infinite.

Old isfinite, isinf, and isnan tests are modernized and instead of fixtures the torch results are compared with NumPy. A new test is added for complex isfinite, isinf, and isnan. The docs for each function are updated to clarify what finite, infinite, and NaN values are.

The new tests rely on a new helper, _np_compare, that we'll likely want to generalize in the near future and use in more tests.

Addresses part of the complex support tasks. See https://github.com/pytorch/pytorch/issues/33152.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36648

Differential Revision: D21054766

Pulled By: mruberry

fbshipit-source-id: d947707c5437385775c82f4e6c722349ca5a2174
2020-04-16 09:09:02 -07:00
anjali411
9e016f77a8 Added complex types to get_all_dtypes and turned on masked_fill for complex (#36335)
Summary:
1. Added complex dtypes to get_all_dtypes to unify testing for complex dtypes with other dtypes so that they don't get out of sync with behavior supported for other dtypes.
2. resolves https://github.com/pytorch/pytorch/issues/36322, https://github.com/pytorch/pytorch/issues/36327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36335

Differential Revision: D21045603

Pulled By: anjali411

fbshipit-source-id: 5089306b66fdc18148e831f56298da5de673be67
2020-04-16 08:24:45 -07:00
lixinyu
1e7155caa5 Bucketization (#7284) (#34577)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34577

Test Plan: Imported from OSS

Differential Revision: D20380975

Pulled By: glaringlee

fbshipit-source-id: d75939bc54d98675f88d7037491a8420ac20847a
2020-04-15 10:32:51 -07:00
Vasiliy Kuznetsov
16e90eba59 hardsigmoid: add cuda kernels (#36351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36351

Adds CUDA kernels for hardsigmoid, to enable its use in training.

Note: the update to the cpu backward pass is to keep the cpu vs cuda
logic consistent, no change in functionality.

Test Plan:
add CI for the forward pass
run this for the backward pass:
https://gist.github.com/vkuzo/95957d365600f9ad10d25bd20f58cc1a

Imported from OSS

Differential Revision: D20955589

fbshipit-source-id: dc198aa6a58e1a7996e1831f1e479c398ffcbc90
2020-04-15 10:15:49 -07:00