Commit Graph

1594 Commits

Author SHA1 Message Date
Ivan Yashchuk
260daf088d Added linalg.cholesky (#46083)
Summary:
This PR adds `torch.linalg.cholesky` function that matches `numpy.linalg.cholesky`.

Fixed `lda` argument to `lapackCholesky` calls.
Added `random_hermitian_pd_matrix` helper function for tests.

Ref https://github.com/pytorch/pytorch/issues/42666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46083

Reviewed By: ailzhang

Differential Revision: D24861752

Pulled By: mruberry

fbshipit-source-id: 214dbceb4e8a2c589df209493efd843962d25593
2020-11-13 16:50:40 -08:00
Richard Zou
1c7c612af0 Revert D24543682: [pytorch][PR] Added support for complex input for torch.lu_solve
Test Plan: revert-hammer

Differential Revision:
D24543682 (ffd0003022)

Original commit changeset: 165bde39ef95

fbshipit-source-id: 790b4157fdbc7149aaf0748555efe6daed7e1a23
2020-11-13 08:24:53 -08:00
Ivan Yashchuk
ffd0003022 Added support for complex input for torch.lu_solve (#46862)
Summary:
`torch.lu_solve` now works for complex inputs both on CPU and GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex dtypes, but I didn't modify/improve the body of the tests.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46862

Reviewed By: nikithamalgifb

Differential Revision: D24543682

Pulled By: anjali411

fbshipit-source-id: 165bde39ef95cafebf976c5ba4b487297efe8433
2020-11-13 02:35:31 -08:00
Gao, Xiang
0652d755d3 Fix some flaky tests in test_torch.py and test_nn.py (#46941)
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941

Reviewed By: heitorschueroff

Differential Revision: D24852725

Pulled By: mruberry

fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
2020-11-12 22:35:42 -08:00
kshitij12345
3649a2c170 [numpy] torch.sqrt : promote integer inputs to float (#47293)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47293

Reviewed By: malfet

Differential Revision: D24855994

Pulled By: mruberry

fbshipit-source-id: 1e6752f2eeba6d638dea0bdea0c650cf722718c9
2020-11-12 16:16:09 -08:00
Ivan Yashchuk
149190c014 Added CUDA support for complex input for torch.solve (#47045)
Summary:
`torch.solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.
Differentiation also works correctly with complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/41084
Ref. https://github.com/pytorch/pytorch/issues/33152

anjali411 I hope you don't mind that I took over https://github.com/pytorch/pytorch/pull/42737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47045

Reviewed By: nikithamalgifb

Differential Revision: D24921503

Pulled By: anjali411

fbshipit-source-id: 4c3fc4f193a84b6e28c43c08672d480715000923
2020-11-12 12:22:59 -08:00
Gregory Chanan
b6cb2caa68 Revert "Fixed einsum compatibility/performance issues (#46398)" (#47821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47821

This reverts commit a5c65b86ce.

 Conflicts:
	test/test_linalg.py

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24909923

Pulled By: gchanan

fbshipit-source-id: 9dcf98e7c4a3c7e5aaffe475867fa086f3bb6ff2
2020-11-12 08:11:40 -08:00
anjali411
e1ee3bfc0e Port bmm and baddbmm from TH to ATen (#42553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42553

Ports `torch.bmm` and `torch.baddbmm` from TH to ATen, as well as adds support for complex dtypes. Also removes dead TH code for Level 2 functions.

Closes #24539

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24893511

Pulled By: anjali411

fbshipit-source-id: 0eba3f2aec99c48b3018a5264ee7789279cfab58
2020-11-12 07:57:42 -08:00
Ivan Yashchuk
52ec8b9340 Added CUDA support for complex input for torch.triangular_solve (#46916)
Summary:
`torch.triangular_solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46916

Reviewed By: navahgar, agolynski

Differential Revision: D24706647

Pulled By: anjali411

fbshipit-source-id: fe780eac93d2ae1b2549539bb385e5fac25213b3
2020-11-11 16:08:11 -08:00
Ivan Yashchuk
a1db5b0f2b Added CUDA support for complex input for torch.inverse #2 (#47595)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Opening a new PR here. The previous PR was merged and reverted due to a bug in tests marked with `slowTest`.
Previous PR https://github.com/pytorch/pytorch/pull/45034

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47595

Reviewed By: navahgar

Differential Revision: D24840955

Pulled By: anjali411

fbshipit-source-id: ec49fffdc4b3cb4ae7507270fa24e127be14f59b
2020-11-11 11:06:08 -08:00
Heitor Schueroff
a5c65b86ce Fixed einsum compatibility/performance issues (#46398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46398

This PR makes torch.einsum compatible with numpy.einsum except for the sublist input option as requested here https://github.com/pytorch/pytorch/issues/21412. It also fixed 2 performance issues linked below and adds a check for reducing to torch.dot instead of torch.bmm which is faster in some cases.

fixes #45854, #37628, #30194, #15671

fixes #41467 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.randn(10000, 100, 101, device='cuda')
b = torch.randn(10000, 101, 3, device='cuda')

c = torch.randn(10000, 100, 1, device='cuda')
d = torch.randn(10000, 100, 1, 3, device='cuda')

print(Timer(
    stmt='torch.einsum("bij,bjf->bif", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("bic,bicf->bif", c, d)',
    globals={'c': c, 'd': d}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413850>
torch.einsum("bij,bjf->bif", a, b)
  Median: 4.53 ms
  IQR:    0.00 ms (4.53 to 4.53)
  45 measurements, 1 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413700>
torch.einsum("bic,bicf->bif", c, d)
  Median: 63.86 us
  IQR:    1.52 us (63.22 to 64.73)
  4 measurements, 1000 runs per measurement, 1 thread
```

fixes #32591 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.rand(1, 1, 16, 2, 16, 2, 16, 2, 2, 2, 2, device="cuda")
b = torch.rand(729, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, device="cuda")

print(Timer(
    stmt='(a * b).sum(dim = (-3, -2, -1))',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("...ijk, ...ijk -> ...", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de28850>
(a * b).sum(dim = (-3, -2, -1))
  Median: 17.86 ms
  2 measurements, 10 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de286a0>
torch.einsum("...ijk, ...ijk -> ...", a, b)
  Median: 296.11 us
  IQR:    1.38 us (295.42 to 296.81)
  662 measurements, 1 runs per measurement, 1 thread
```

TODO

- [x] add support for ellipsis broadcasting
- [x] fix corner case issues with sumproduct_pair
- [x] update docs and add more comments
- [x] add tests for error cases

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860367

Pulled By: heitorschueroff

fbshipit-source-id: 31110ee598fd598a43acccf07929b67daee160f9
2020-11-10 19:38:43 -08:00
Heitor Schueroff
bf6a156f64 Fix kthvalue error for scalar input (#47600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47600

fixes https://github.com/pytorch/pytorch/issues/30818

Note that the median case was already fixed by https://github.com/pytorch/pytorch/pull/45847

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860337

Pulled By: heitorschueroff

fbshipit-source-id: 69ccbbb6c7c86671e5712b1c2056c012d898b4f2
2020-11-10 17:21:52 -08:00
kshitij12345
6575e674ce [numpy] torch.{all, any} : Extend Dtype Support (#44790)
Summary:
Reference https://github.com/pytorch/pytorch/issues/44779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44790

Reviewed By: bdhirsh

Differential Revision: D24393119

Pulled By: heitorschueroff

fbshipit-source-id: a9b88e9d06b3c282f2e5360b6eaea4ae8ef77c1d
2020-11-10 17:11:39 -08:00
Natalia Gimelshein
c9d37675b2 Back out "[pytorch][PR] The dimension being reduced should not be coalesced by TensorIterator" (#47642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47642

Original commit changeset: 02bb2b15694c

Test Plan: Covered by CI tests

Reviewed By: anjali411

Differential Revision: D24849072

fbshipit-source-id: a8790cbf46936aee7a6f504dac8595997175fc65
2020-11-10 16:31:33 -08:00
Radhakrishnan Venkataramani
163adb9fa7 Add HalfToFloat + FloatToHalf operators to PyTorch (#45092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092

Adding two operators
1. at::float_to_half -> Converts FP32 tensor to FP16 tensor
2. at::half_to_float -> Converts FP16 tensor to FP32 tensor.

These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath.

Test Plan:
buck test //caffe2/test:torch -- .*test_half_tensor.*

Run benchmark locally using

```
buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test
```

AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days.

Benchmark for 512x512 tensor with FbGeMM implementation

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1246.332

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1734.304
```

Benchmark for 512x512 tensor trunk with no FbGeMM integration.

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 169045.724

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 152382.494
```

Reviewed By: ngimel

Differential Revision: D23824869

fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c
2020-11-10 12:00:53 -08:00
Gregory Chanan
65a72cae2c Fix type promotion for trace on CPU. (#47305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47305

Fixes https://github.com/pytorch/pytorch/issues/47127.

Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24729627

Pulled By: gchanan

fbshipit-source-id: 151b786b53e7b958f0929c803dbf8e95981c6884
2020-11-10 07:46:03 -08:00
John Kilpatrick
8aca85dbcd Add diagflat complex support (#47564)
Summary:
Adds complex numbers support for `torch.diag`
``` python
>>> import torch
>>> a = torch.ones(2, dtype=torch.complex128)
>>> torch.diagflat(a)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], dtype=torch.complex128)
>>> b = a.cuda()
>>> torch.diagflat(b)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], device='cuda:0', dtype=torch.complex128)
```

Note that automatic differentiation isn't implemented:
``` python
>>> d = torch.ones(1, dtype=torch.complex128, requires_grad=True)
>>> torch.diagflat(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: diag does not support automatic differentiation for outputs with complex dtype.
```

Fixes https://github.com/pytorch/pytorch/issues/47499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47564

Reviewed By: heitorschueroff

Differential Revision: D24844467

Pulled By: anjali411

fbshipit-source-id: 9c8cb795d52880b7dcffab0c059b0f6c2e5ef151
2020-11-09 20:28:23 -08:00
Xiang Gao
f23a2a1115 The dimension being reduced should not be coalesced by TensorIterator (#47237)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583#issuecomment-720172838

Also add overload of `<<` for convenience of debugging.

This PR is tested by `test_reduction_split_cuda` which was added in https://github.com/pytorch/pytorch/pull/37788.

Reproduce
```python
import torch

a = torch.zeros(8, 1, 128, 1024, 1024)
a.cuda().sum(1)
```

Before

```
TensorIterator @ 0x7ffd05b10ba0 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1073741824]
  strides(*) = {
    (0) = [4]
    (1) = [4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

After

```
TensorIterator @ 0x7fffc9051010 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1, 1073741824]
  strides(*) = {
    (0) = [0, 4]
    (1) = [536870912, 4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47237

Reviewed By: ejguan

Differential Revision: D24734763

Pulled By: ngimel

fbshipit-source-id: 02bb2b15694c68f96434f55033b63b6e5ff7085b
2020-11-07 01:30:24 -08:00
Xiong Wei
f90da88d8f Add complex support for torch.mean [CUDA] (#47048)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47048

Reviewed By: heitorschueroff

Differential Revision: D24729895

Pulled By: anjali411

fbshipit-source-id: 8e948480eb87c37de810207edf909375c0380772
2020-11-06 21:29:19 -08:00
Howard Huang
451e7d3db4 Enable diag for bool Tensors (#47455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47455

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772483

Pulled By: H-Huang

fbshipit-source-id: 08ea4af4352972617db3c6475943b326f36b3049
2020-11-06 21:29:17 -08:00
Howard Huang
3253ccbd9f Add bool tensor support for where (#47454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47454

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772482

Pulled By: H-Huang

fbshipit-source-id: ea488aae5bf64ac20f7a5d001e8edf55eed16eaf
2020-11-06 21:26:24 -08:00
Rong Rong
5614f72534 Suppres test issues in test_torch running in sandcastle (#47474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47474

After enabling GPU/Re, some issues were specific to those runs

Test Plan:
```
buck test -c test.external_runner=tpx mode/opt //caffe2/test:torch_cuda -- --use-remote-execution --force-tpx --run-disabled
```

Reviewed By: malfet, janeyx99

Differential Revision: D24771578

fbshipit-source-id: 1ada79dae12c8cb6f795a0d261c60f038eee2dfb
2020-11-06 10:34:28 -08:00
Edward Yang
1aeefcdaa6 Revert D24730264: [pytorch][PR] Added CUDA support for complex input for torch.inverse
Test Plan: revert-hammer

Differential Revision:
D24730264 (33acbedace)

Original commit changeset: b9c94ec46301

fbshipit-source-id: beb9263700e9bc92685f74c37c46aa33f3b595b9
2020-11-06 07:28:14 -08:00
Ivan Yashchuk
33acbedace Added CUDA support for complex input for torch.inverse (#45034)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Test cases with complex matrices are xfailed for now. For example, batched matmul does not work with complex yet.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45034

Reviewed By: zou3519

Differential Revision: D24730264

Pulled By: anjali411

fbshipit-source-id: b9c94ec463012913c117278a884adeee96ea02aa
2020-11-05 16:30:11 -08:00
Heitor Schueroff
a4ba018e57 Updated docs/test for dot and vdot (#47242)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47242

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D24733771

Pulled By: heitorschueroff

fbshipit-source-id: 92e3b0e28e0565918335fa85d52abe5db9eeff57
2020-11-05 06:27:50 -08:00
Xiang Gao
f19637e6ee Expand the test of torch.addbmm and torch.baddbmm (#47079)
Summary:
This is to satisfy the request at https://github.com/pytorch/pytorch/pull/42553#issuecomment-673673914. See also https://github.com/pytorch/pytorch/pull/47124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47079

Reviewed By: ejguan

Differential Revision: D24735356

Pulled By: ngimel

fbshipit-source-id: 122fceb4902658f350c2fd6f92455adadd0ec2a4
2020-11-04 21:11:26 -08:00
Xiang Gao
030caa190f Expand the test of torch.bmm on CUDA (#47124)
Summary:
basically https://github.com/pytorch/pytorch/pull/47070, enabled on all CI with `ci-all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47124

Reviewed By: ejguan

Differential Revision: D24735130

Pulled By: ngimel

fbshipit-source-id: c2124562a9f9d1caf24686e5d8a1106c79366233
2020-11-04 17:29:34 -08:00
Brian Hirsh
fe17269e75 Revert "Revert D24335982: explicitly error out in comparison ops when the types don't match" (#47288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47288

This reverts commit b3eb0c86cf.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24706531

Pulled By: bdhirsh

fbshipit-source-id: f3bf34ddba7882932155819251b6c7dcb5c6b56c
2020-11-04 09:27:47 -08:00
Erjia Guan
f1ac63d324 Implement copysign (#46396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396

Related #38349

[numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign)
- No in-place function
- No method
- Optional output
- Available: byte, char, bool, int, short, long, float, double, half
- Integral promoted to float
- Not available: float/double complex

`c = np.copysign(a, b)`
|  a |  b |  c | a.grad |
| -1 | -1 | -1 |   1  |
| -0 | -1 | -0 |   0  |
|  0 | -1 | -0 |  0  |
|  1 | -1 | -1 |  -1  |
| -1 | -0 |  -1 |  1  |
| -0 | -0 |  0 |  0  |
|  0 | -0 |  0 |   0  |
|  1 | -0 |  -1 |   -1  |
| -1 |  0 |  1 |  -1  |
| -0 |  0 |  0 |  0  |
|  0 |  0 |  0 |   0  |
|  1 |  0 |  1 |   1  |
| -1 |  1 |  1 |  -1  |
| -0 |  1 |  0 |  0  |
|  0 |  1 |  0 |   0  |
|  1 |  1 |  1 |   1  |

This function becomes **non-differentiable** at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0.

TODO:
- [x] test (cpu/gpu)
- [x] doc
- [x] ~kernel_vec~

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24401366

Pulled By: ejguan

fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d
2020-11-04 08:08:57 -08:00
Qi Zhou
0ec717c830 Support int32 indices and offsets in nn.EmbeddingBag (#46758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Reviewed By: ngimel

Differential Revision: D24470808

fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
2020-11-03 23:33:50 -08:00
Howard Huang
a8ef4d3f0b Provide 'out' parameter for 'tensordot' (#47278)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42102

Added an optional out parameter to the tensordot operation to allow using buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47278

Test Plan: pytest test/test_torch.py -k tensordot -v

Reviewed By: agolynski

Differential Revision: D24706258

Pulled By: H-Huang

fbshipit-source-id: eb4bcd114795f67de3a670291034107d2826ea69
2020-11-03 15:56:00 -08:00
Xiao Wang
774b638eb6 Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332)
Summary:
Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`.

There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it.

Definition of `largeTensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)

`_has_sufficient_memory`

d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)

`largeCUDATensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332

Reviewed By: ngimel

Differential Revision: D24698690

Pulled By: mruberry

fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307
2020-11-03 11:43:49 -08:00
Richard Zou
86151da19e Port CPU Trace from TH to ATen (#47126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126

Context
-------
This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360.
I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it.

Benchmarks
----------

TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears.

```
import torch

shapes = [
    [1, 1],
    [100, 100],
    [1000, 1000],
    [10000, 10000],
    [100000, 100000],
]

for shape in shapes:
    x = torch.ones(shape)
    %timeit x.trace()

Before:
1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After:
2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Future work
-----------
Things that can be done after this PR:
- add complex tensor support
- Fix the type promotion discrepancy between CPU and CUDA

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24683259

Pulled By: zou3519

fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb
2020-11-02 16:03:22 -08:00
Richard Zou
8054ae3e77 Add test for trace (#47125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47125

We didn't actually have any tests for torch.trace. The tests expose a
discrepancy between the behavior of torch.trace on CPU and CUDA that
I'll file an issue for.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24683260

Pulled By: zou3519

fbshipit-source-id: 71dd3af62bc98c6b9b0ba2bf2923cb6d44daa640
2020-11-02 16:00:33 -08:00
Brian Hirsh
b3eb0c86cf Revert D24335982: explicitly error out in comparison ops when the types don't match
Test Plan: revert-hammer

Differential Revision:
D24335982 (60fea510a1)

Original commit changeset: 3dfb02bcb403

fbshipit-source-id: 00072f1b00e228bbbe295053091cf4a7a46f4668
2020-11-02 14:08:01 -08:00
Xiong Wei
22b3d414de Enhance the torch.pow testcase for the complex scalar base (#47101)
Summary:
Related https://github.com/pytorch/pytorch/issues/45259

This PR is to address the https://github.com/pytorch/pytorch/pull/45259#discussion_r514390664

- leverage the `make_tensor`  function to generate a random tensor as the exponent, preventing the full zeros for the integer exponent.
- add some special cases for the zero exponents and the `1 + 0j` base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47101

Reviewed By: mruberry

Differential Revision: D24682430

Pulled By: zou3519

fbshipit-source-id: f559dc0ba08f37ae070036fb25a52ede17a24149
2020-11-02 13:13:15 -08:00
Brian Hirsh
60fea510a1 explicitly error out in comparison ops when the types don't match (#46399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46399

Explicitly error out in comparison/logical ops when the dtypes of the various input/output tensors don't match. See [this comment](https://github.com/pytorch/pytorch/pull/46399#discussion_r505686406) for more details.

fixes #42660

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24335982

Pulled By: bdhirsh

fbshipit-source-id: 3dfb02bcb403dda5bcbf5ed3eae543354ad698b2
2020-11-02 11:42:32 -08:00
Nikita Shulga
edac4060d7 Fix mul cuda for bool (#47031)
Summary:
Also, add tests for tensor by scalar multiplication / division

Fixes https://github.com/pytorch/pytorch/issues/47007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47031

Reviewed By: walterddr

Differential Revision: D24608874

Pulled By: malfet

fbshipit-source-id: 4e15179904814d6e67228276d3d11ff1b5d15d0d
2020-10-30 10:38:32 -07:00
Heitor Schueroff
ddeacf1565 Fix median bug on discontigous tensors (#46917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46917

fixes https://github.com/pytorch/pytorch/issues/46814

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633412

Pulled By: heitorschueroff

fbshipit-source-id: 54732671b298bdc2b04b13ab3a373892ee0933c3
2020-10-29 17:12:22 -07:00
Xiong Wei
74d730c0b5 implement NumPy-like functionality column_stack, row_stack (#46313)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`.

Todo

- [x] docs
- [x] alias pattern for `row_stack`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313

Reviewed By: ngimel

Differential Revision: D24585471

Pulled By: mruberry

fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c
2020-10-29 12:14:39 -07:00
mfkasim91
6eaa324c9f Implement torch.igamma (#46183)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41637
This is regularized lower incomplete gamma function, equivalent to scipy's `gammainc` and tensorflow `igamma`.

cc fritzo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46183

Reviewed By: gchanan

Differential Revision: D24479126

Pulled By: mruberry

fbshipit-source-id: fdf8ea289fe4ca1b408810732192411e948fcdfe
2020-10-29 11:40:18 -07:00
Sameer Deshmukh
2249a293b7 Fix segfault with torch.orgqr. (#46700)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41768

The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700

Reviewed By: albanD

Differential Revision: D24616427

Pulled By: mruberry

fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3
2020-10-29 10:34:39 -07:00
Kurt Mohler
b75b961934 Fix requires_grad arg for new_full, new_empty, new_zeros (#46486)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46486

Reviewed By: gchanan

Differential Revision: D24497034

Pulled By: ezyang

fbshipit-source-id: 769a7f00f9a8f7cb77273a1193173a837ae7e32f
2020-10-28 09:34:53 -07:00
kiyosora
53839ac9d7 Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/46681

```
>>> x = torch.randn(10, device='cuda')
>>> y = torch.tensor(1.)
>>> torch.heaviside(x, y)
tensor([0., 1., 0., 1., 1., 0., 1., 1., 1., 0.], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46831

Reviewed By: navahgar

Differential Revision: D24567953

Pulled By: izdeby

fbshipit-source-id: e5fcf4355b27ce0bdf434963d01863d3b24d0bea
2020-10-27 16:47:33 -07:00
Hong Xu
bcbb6baccf Add a warning message that torch.sign would not support complex numbers (#43280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43280

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24538769

Pulled By: anjali411

fbshipit-source-id: ab2d5283501e4c1d7d401d508e32f685add7ebb1
2020-10-26 21:13:12 -07:00
Xiang Gao
7731370e71 CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997

Reviewed By: izdeby

Differential Revision: D24547748

Pulled By: ngimel

fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a
2020-10-26 16:01:22 -07:00
Xiang Gao
99cf3b1ce4 CUDA BFloat16 signal windows (#45155)
Summary:
Looks like this op is never tested for the support of different dtypes?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45155

Reviewed By: zou3519

Differential Revision: D24438839

Pulled By: ngimel

fbshipit-source-id: 103ff609e11811a0705d04520c2b97c456b623ef
2020-10-26 15:53:30 -07:00
Alexander Grund
93719440b8 Replace map(lambda constructs (#46462)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal

Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462

Reviewed By: zou3519

Differential Revision: D24422343

Pulled By: ezyang

fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
2020-10-22 09:50:22 -07:00
Pearu Peterson
905ed3c840 Revised sparse tensor documentation. (#45400)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44635.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45400

Reviewed By: ezyang

Differential Revision: D24359410

Pulled By: mruberry

fbshipit-source-id: 37c691a49a7b0042c7a298e0ed1226702b097c8b
2020-10-22 02:07:54 -07:00
Xiao Wang
fe4f90c40b Cusolver inverse check info (#46625)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46625

Reviewed By: zou3519

Differential Revision: D24438577

Pulled By: ngimel

fbshipit-source-id: d00e6eb2eae4aa39ca6ecf5914fe9cf37c24b906
2020-10-21 21:46:33 -07:00
lixinyu
a651b876a7 preserve non-dense or overlapping tensor's layout in *_like functions (#46046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46046

*_like functions are used in pytorch to create a new tensor with the same shape of the input tensor. But we don’t always preserve the layout permutation of the tensor. Current behavior is that, for a dense and non-overlapping tensor, its layout permutation is preserved. For eg.  passing a channel last contiguous tensor t with ‘shape/stride’  (2, 4, 3, 2)/(24, 1, 8, 4) to empty_like(t) function will create a new tensor with exactly the same ‘shape/stride’ as the input tensor t. However, if the input tensor is non-dense or has overlap, we simply create a contiguous tensor based on input tensor’s shape, so the tensor layout permutation is lost.

This PR preserves the layout permutation for non-dense or overlapping tensor. The strides propagation rule that used in this PR is exactly the same as what is being used in TensorIterator.  The behavior changes are listed below:

| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

This is to solve the non-dense tensor layout problem in #45505

TODO:
- [x] Fix all the BC broken test cases in pytorch
- [ ] Investigate if any fb internal tests are broken

This change will cover all kinds of non-dense tensors.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24288970

Pulled By: glaringlee

fbshipit-source-id: 320fd4e0d1a810a12abfb1441472298c983a368d
2020-10-20 19:49:49 -07:00
Kurt Mohler
e6ed887908 Add view test for tensor_split (#46427)
Summary:
Fulfills Mike's suggestion here: https://github.com/pytorch/pytorch/pull/44868#discussion_r505095018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46427

Reviewed By: ezyang

Differential Revision: D24355107

Pulled By: mruberry

fbshipit-source-id: bddef2f9c2c41b5c5ac47a17d5ecdda580072e99
2020-10-20 09:56:37 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Ailing Zhang
8c629ecc9a [WIP] Move catchAll to Math (#45939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45939

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165890

Pulled By: ailzhang

fbshipit-source-id: 72fe71ea95a738251b2fafc9eea4ab3831cf426b
2020-10-16 16:17:16 -07:00
Nikita Vedeneev
9300a27702 Make torch.lu support complex input on CUDA. (#45898)
Summary:
As per title. LU decomposition is used for computing determinants, and I need this functionality to implement the matrix square root. Next PR on my list is to enable `torch.det` on CUDA with complex input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45898

Reviewed By: heitorschueroff

Differential Revision: D24306951

Pulled By: anjali411

fbshipit-source-id: 168f578fe65ae1b978617a66741aa27e72b2172b
2020-10-16 10:29:39 -07:00
Jane Xu
c99378af1b Fixing pow for special case between cuda tensors and cpu tensors and reframed test cases a tiny bit (#46320)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I now isolated the special case to be only between cuda tensor bases and cpu tensor exponents. My previous fix was not a complete fix--it fixed some stuff but broke others. The current fix is a more complete fix:
```
In [1]: import torch
In [2]: a=torch.randn(3)
In [3]: b=torch.tensor(2, device="cuda")
In [4]: torch.pow(a,b) #should not work and throws exception now!

In [5]: a=torch.tensor(3, device="cuda")
In [6]: b=torch.tensor(2)
In [7]: torch.pow(a,b) #should work, and now does

In [8]: a=torch.randn(3, device="cuda")
In [9]: torch.pow(a,b) # yeah, that one is fixed and still works
```

To add a test case to reflect the change, I had to modify the existing setup a little bit. I think it is an improvement but would appreciate any tips on how to make it better!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46320

Reviewed By: malfet

Differential Revision: D24306610

Pulled By: janeyx99

fbshipit-source-id: cc74c61373d1adc2892a7a31226f38895b83066a
2020-10-15 13:43:47 -07:00
Ivan Yashchuk
c1141b6f68 Added support for complex torch.pinverse (#45819)
Summary:
This PR adds support for complex-valued input for `torch.pinverse`.
Fixed cuda SVD implementation to return singular values with real dtype.

Fixes https://github.com/pytorch/pytorch/issues/45385.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819

Reviewed By: heitorschueroff

Differential Revision: D24306539

Pulled By: anjali411

fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a
2020-10-15 12:28:22 -07:00
Xiang Gao
5ce46fbbca BFloat16 support for torch.sign (#45244)
Summary:
Added BF16 support for torch.sign on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45244

Reviewed By: zou3519

Differential Revision: D23932304

Pulled By: izdeby

fbshipit-source-id: e50b9510ecf2337ec0288392d6950046116b2599
2020-10-15 12:23:14 -07:00
Jane Xu
ad376f1a62 trying to make pow work for tensor raised to the power of a scalar (#46185)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I'm not sure this is the most performant solution, but this works:

torch.pow(cuda_tensor, 5) should work and worked before.
torch.pow(cuda_tensor, torch.tensor(5)), should work **and works now!**
torch.pow(cuda_tensor, torch.tensor((5,))), should NOT work and complain the tensors are on different devices and indeed continues to complain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46185

Reviewed By: glaringlee, malfet

Differential Revision: D24257687

Pulled By: janeyx99

fbshipit-source-id: 2daf235d62ec5886d7c153da05445c2ec71dec98
2020-10-13 10:14:36 -07:00
Erjia Guan
bed3b40523 Implement ravel (#46098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46098

Doc:
![image](https://user-images.githubusercontent.com/68879799/95611323-ae5cf380-0a2f-11eb-9b8e-56bf79ce68af.png)

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24253213

Pulled By: ejguan

fbshipit-source-id: 42a866c902272cbe3743a9d0cb3afb9165d51c0b
2020-10-12 16:00:44 -07:00
kshitij12345
a814231616 [fix] torch.kthvalue : handle non-contiguous CUDA tensor (#45802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45721

TODO
* [x] Test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45802

Reviewed By: ngimel

Differential Revision: D24236706

Pulled By: mruberry

fbshipit-source-id: 5a51049233efa710f9500a6f7d099c90d43062c9
2020-10-11 20:13:08 -07:00
Kurt Mohler
a0a8bc8870 Fix mistakes and increase clarity of norm documentation (#42696)
Summary:
* Removes incorrect statement that "the vector norm will be applied to the last dimension".
* More clearly describe each different combination of `p`, `ord`, and input size.
* Moves norm tests from `test/test_torch.py` to `test/test_linalg.py`
* Adds test ensuring that `p='fro'` and `p=2` give same results for mutually valid inputs

Fixes https://github.com/pytorch/pytorch/issues/41388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42696

Reviewed By: bwasti

Differential Revision: D23876862

Pulled By: mruberry

fbshipit-source-id: 36f33ccb6706d5fe13f6acf3de8ae14d7fbdff85
2020-10-10 14:12:43 -07:00
Nikita Shulga
f363a2e106 Mark top 3 slowest tests as slow (#46068)
Summary:
`TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout)
`TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each
`TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish

Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068

Reviewed By: mruberry

Differential Revision: D24208660

Pulled By: malfet

fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7
2020-10-08 21:10:03 -07:00
Ivan Yashchuk
f010df35e5 Added CUDA support for complex input for QR decomposition (#45032)
Summary:
QR decomposition now works for complex inputs on GPU.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45032

Reviewed By: ailzhang

Differential Revision: D24199105

Pulled By: anjali411

fbshipit-source-id: 249552b31fd713446e609b66e508ac54b817b98e
2020-10-08 13:24:21 -07:00
Heitor Schueroff de Souza
636eb18029 Fixed median nan propagation and implemented nanmedian (#45847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847

Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24136629

Pulled By: heitorschueroff

fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9
2020-10-08 11:20:21 -07:00
Kurt Mohler
ef4817fe5a Add tensor_split function, based on numpy.array_split (#45168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168

Reviewed By: ngimel

Differential Revision: D24166164

Pulled By: mruberry

fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6
2020-10-07 23:14:48 -07:00
Xiang Gao
b2bff9e431 Workaround for cublas bug for 45724 (#46001)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46001

Reviewed By: mruberry

Differential Revision: D24184058

Pulled By: ngimel

fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9
2020-10-07 22:38:19 -07:00
Your Name
c59c4b0d77 Fix cholesky TF32 tests (#45492)
Summary:
This test is changed one day before the landing of the tf32 tests PR, therefore the fix for this is not included in that PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45492

Reviewed By: ezyang

Differential Revision: D24101876

Pulled By: ngimel

fbshipit-source-id: cb3615b2fb8acf17abe54cd18b1faec26582d6b6
2020-10-07 20:42:06 -07:00
Xiang Gao
903acc6b83 CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247)
Summary:
Add CUDA BFloat16 support of clamp, remainder, lshift, rshift

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247

Reviewed By: dzhulgakov

Differential Revision: D24174258

Pulled By: ngimel

fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638
2020-10-07 20:37:06 -07:00
Vaidotas Simkus
e154b36685 Standardized clamp kernels to Numpy-like implementation (#43288)
Summary:
**BC-breaking note**

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)

but in other places it clamps differently:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)

78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)

These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

```
t = torch.arange(200).to(torch.float)
torch.clamp(t, 4, 2)[0]
: tensor(2.)

torch.clamp(t.cuda(), 4, 2)[0]
: tensor(4., device='cuda:0')

torch.clamp(torch.tensor(0), 4, 2)
: tensor(4)
```

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.

**PR Summary**

Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.

The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288

Reviewed By: colesbury

Differential Revision: D24079453

Pulled By: mruberry

fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7
2020-10-06 13:42:08 -07:00
KyleCZH
a9a9d0b181 Rocm skip test cases (#45782)
Summary:
Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA)
- test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)
- test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA)
- test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest)
jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782

Reviewed By: VitalyFedyunin

Differential Revision: D24115581

Pulled By: xw285cornell

fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5
2020-10-05 15:12:25 -07:00
Xiang Gao
e1ff46b6e5 CUDA BFloat16 TopK (#44755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755

Reviewed By: mruberry

Differential Revision: D23741680

Pulled By: ngimel

fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0
2020-10-04 11:38:00 -07:00
Nikita Shulga
3a27fc966a Test torch.svd using complex float and double numbers (take 2) (#45795)
Summary:
Adds support for magmaSvd for complex numbers

Fixes use-after-free error in `apply_symeig`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795

Reviewed By: ezyang

Differential Revision: D24096955

Pulled By: malfet

fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0
2020-10-03 11:33:28 -07:00
Nikita Shulga
5a47a2126d Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers
Test Plan: revert-hammer

Differential Revision:
D24018160 (888f3c12e7)

Original commit changeset: 1b6103f5af94

fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9
2020-10-02 13:33:11 -07:00
Nikita Shulga
888f3c12e7 Test torch.svd using complex float and double numbers (#45572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572

Reviewed By: anjali411

Differential Revision: D24018160

Pulled By: malfet

fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34
2020-10-02 08:29:14 -07:00
Ivan Yashchuk
77cd8e006b Added support for complex torch.symeig (#45121)
Summary:
This PR adds support for complex-valued input for `torch.symeig`.

TODO:
- [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat`
Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work.

Fixes https://github.com/pytorch/pytorch/issues/45061.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121

Reviewed By: mrshenli

Differential Revision: D24049649

Pulled By: anjali411

fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5
2020-10-01 08:57:13 -07:00
Nikita Shulga
c87ff2cb90 Enable transposed tensor copy for complex types (#45487)
Summary:
This enables a special copy operator for transposed tensors with more than 360 elements:
417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)

Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))"

Fixes https://github.com/pytorch/pytorch/issues/45269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487

Reviewed By: anjali411

Differential Revision: D23984441

Pulled By: malfet

fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f
2020-09-29 19:22:05 -07:00
Mike Ruberry
b66ac1e928 Updates nonzero's as_tuple behavior to no longer warn. (#45413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44284.

[torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413

Reviewed By: ngimel

Differential Revision: D23975015

Pulled By: mruberry

fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc
2020-09-29 12:16:59 -07:00
Mike Ruberry
b2925671b6 Updates deterministic flag to throw a warning, makes docs consistent (#45410)
Summary:
Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410

Reviewed By: ngimel

Differential Revision: D23974988

Pulled By: mruberry

fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d
2020-09-29 11:17:33 -07:00
Hong Xu
15f85eea18 Support bfloat16 and complex dtypes for logical_not (#43537)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751950

Pulled By: mruberry

fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb
2020-09-29 11:00:05 -07:00
Mike Ruberry
6d37126a10 Makes rdiv consistent with div (#45407)
Summary:
In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407

Reviewed By: ngimel

Differential Revision: D23974967

Pulled By: mruberry

fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95
2020-09-29 08:34:01 -07:00
Himangshu
7cde662f08 Add check for Complex Type to allow non integral alpha. (#45200)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200

Reviewed By: gchanan

Differential Revision: D23940134

Pulled By: anjali411

fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139
2020-09-29 07:36:46 -07:00
anjali411
534f2ae582 Disable inplace abs for complex tensors (#45069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069

`torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input.

Test Plan: Imported from OSS

Reviewed By: glaringlee, malfet

Differential Revision: D23818397

Pulled By: anjali411

fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2
2020-09-28 20:33:35 -07:00
Xiong Wei
0c8a6008ac Fix torch.pow when the scalar base is a complex number (#45259)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259

Reviewed By: gchanan

Differential Revision: D23962073

Pulled By: anjali411

fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72
2020-09-28 18:25:53 -07:00
Xiang Gao
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
Mike Ruberry
8bdbedd4ee Revert "Updates and simplifies nonzero as_tuple behavior"
This reverts commit 8b143771d0.
2020-09-27 20:58:42 -07:00
Mike Ruberry
8b143771d0 Updates and simplifies nonzero as_tuple behavior 2020-09-27 20:56:30 -07:00
Xiong Wei
241afc9188 Migrate addr from the TH to Aten (CPU) (#44364)
Summary:
Related https://github.com/pytorch/pytorch/issues/24507
Fixes https://github.com/pytorch/pytorch/issues/24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
2020-09-25 01:18:09 -07:00
Gao, Xiang
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
Hong Xu
b470fa4500 Add complex number support for binary logical operators (#43174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684425

Pulled By: mruberry

fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330
2020-09-23 23:03:00 -07:00
kshitij12345
0b6b735863 [fix] type promotion atan2 (#43466)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466

Reviewed By: malfet

Differential Revision: D23834928

Pulled By: mruberry

fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631
2020-09-23 22:23:05 -07:00
Ailing Zhang
9db3871288 Update true_divide_out to use at::. (#45079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23821701

Pulled By: ailzhang

fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e
2020-09-23 10:50:48 -07:00
Ivan Yashchuk
5b20bf4fd9 Added support for complex input for Cholesky decomposition (#44895)
Summary:
Cholesky decomposition now works for complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/44637.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895

Reviewed By: ailzhang

Differential Revision: D23841583

Pulled By: anjali411

fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478
2020-09-23 08:25:56 -07:00
Xiang Gao
144dacd8d9 CUDA BFloat16 batched gemm (#45167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167

Reviewed By: mruberry

Differential Revision: D23860458

Pulled By: ngimel

fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f
2020-09-22 22:43:52 -07:00
Hong Xu
e2b40ce793 Support BFloat16 for binary logical operators on CUDA (#42485)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684423

Pulled By: mruberry

fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428
2020-09-22 11:42:34 -07:00
anjali411
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
Gao, Xiang
dfb8f2d51f CUDA BFloat16 addmm, addmv (#44986)
Summary:
This PR was originally authored by slayton58. I steal his implementation and added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986

Reviewed By: mruberry

Differential Revision: D23806039

Pulled By: ngimel

fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead
2020-09-21 14:28:27 -07:00
Xiang Gao
581a364437 CUDA BFloat16 unary ops part 1 (#44813)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813

Reviewed By: mruberry

Differential Revision: D23805816

Pulled By: ngimel

fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64
2020-09-21 14:22:31 -07:00
Hong Xu
49db7b59e0 For logical tests, use the dtypes decorator (#42483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684424

Pulled By: mruberry

fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852
2020-09-19 19:01:49 -07:00
Xiao Wang
d75c402755 Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265

This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.

Specifically, when

* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,

cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.

8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)

The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.

On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.

060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)

Note that there is a new heuristic used before cusolver/cublas calls here:

8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)

where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).

Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests

Next step:

If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.

<details>
<summary> benchmark 73499c6 </summary>

benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb

shape meaning:

* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`

| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 |  0.095 |  7.534 |  0.129  |
| [] 4 torch.float32 |  0.009 |  7.522 |  0.129  |
| [] 8 torch.float32 |  0.011 |  7.647 |  0.138  |
| [] 16 torch.float32 |  0.075 |  7.582 |  0.135  |
| [] 32 torch.float32 |  0.073 |  7.573 |  0.191  |
| [] 64 torch.float32 |  0.134 |  7.694 |  0.288  |
| [] 128 torch.float32 |  0.398 |  8.073 |  0.491  |
| [] 256 torch.float32 |  1.054 |  11.860 |  1.074  |
| [] 512 torch.float32 |  5.218 |  14.130 |  2.582  |
| [] 1024 torch.float32 |  19.010 |  18.780 |  6.936  |
| [1] 2 torch.float32 |  0.009 |  0.113 |  0.128 ***regressed |
| [1] 4 torch.float32 |  0.009 |  0.113 |  0.131 ***regressed |
| [1] 8 torch.float32 |  0.011 |  0.116 |  0.129 ***regressed |
| [1] 16 torch.float32 |  0.015 |  0.122 |  0.135 ***regressed |
| [1] 32 torch.float32 |  0.032 |  0.177 |  0.178 ***regressed |
| [1] 64 torch.float32 |  0.070 |  0.420 |  0.281  |
| [1] 128 torch.float32 |  0.328 |  0.816 |  0.490  |
| [1] 256 torch.float32 |  1.125 |  1.690 |  1.084  |
| [1] 512 torch.float32 |  4.344 |  4.305 |  2.576  |
| [1] 1024 torch.float32 |  16.510 |  16.340 |  6.928  |
| [2] 2 torch.float32 |  0.009 |  0.113 |  0.186 ***regressed |
| [2] 4 torch.float32 |  0.011 |  0.115 |  0.184 ***regressed |
| [2] 8 torch.float32 |  0.012 |  0.114 |  0.184 ***regressed |
| [2] 16 torch.float32 |  0.019 |  0.119 |  0.173 ***regressed |
| [2] 32 torch.float32 |  0.050 |  0.170 |  0.240 ***regressed |
| [2] 64 torch.float32 |  0.120 |  0.429 |  0.375  |
| [2] 128 torch.float32 |  0.576 |  0.830 |  0.675  |
| [2] 256 torch.float32 |  2.021 |  1.748 |  1.451  |
| [2] 512 torch.float32 |  9.070 |  4.749 |  3.539  |
| [2] 1024 torch.float32 |  33.655 |  18.240 |  12.220  |
| [4] 2 torch.float32 |  0.009 |  0.112 |  0.318 ***regressed |
| [4] 4 torch.float32 |  0.010 |  0.115 |  0.319 ***regressed |
| [4] 8 torch.float32 |  0.013 |  0.115 |  0.320 ***regressed |
| [4] 16 torch.float32 |  0.027 |  0.120 |  0.331 ***regressed |
| [4] 32 torch.float32 |  0.085 |  0.173 |  0.385 ***regressed |
| [4] 64 torch.float32 |  0.221 |  0.431 |  0.646 ***regressed |
| [4] 128 torch.float32 |  1.102 |  0.834 |  1.055 ***regressed |
| [4] 256 torch.float32 |  4.042 |  1.811 |  2.054 ***regressed |
| [4] 512 torch.float32 |  18.390 |  4.884 |  5.087 ***regressed |
| [4] 1024 torch.float32 |  69.025 |  19.840 |  20.000 ***regressed |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403

Reviewed By: ailzhang, mruberry

Differential Revision: D23717984

Pulled By: ngimel

fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
2020-09-18 20:43:29 -07:00