Commit Graph

1169 Commits

Author SHA1 Message Date
ashishfarmer
bcdff7eb67 Fix for tests on ROCm (#37616)
Summary:
This pull request fixes and re-enables two of the tests disabled in https://github.com/pytorch/pytorch/issues/37427
1. `test_sparse_add_out_bfloat16` in test_sparse.py fixed to use updated `atol` argument instead of `prec` for `assertEqual`
2. The conversion of `flt_min` to `int64` is divergent on HIP compared to numpy. The change removes that conversion from the `test_float_to_int_conversion_finite` test case in test_torch.py

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37616

Differential Revision: D21379876

Pulled By: ezyang

fbshipit-source-id: 2bfb41d67874383a01330c5d540ee516b3b07dcc
2020-05-04 07:16:54 -07:00
Pavel Belevich
b1790794f6 Enforce Tensor.random_ check that from and to are in tensor dtype bounds (#37507)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37507

Replace `TORCH_WARN` with `TORCH_CHECK` if `Tensor.random_()`'s `from` or `to-1` is out of bounds for tensor's dtype. Previously warning said "This warning will become an error in version 1.6 release, please fix the code in advance", so the time has come.

Related to #33106

Test Plan: Imported from OSS

Differential Revision: D21349413

Pulled By: pbelevich

fbshipit-source-id: ac7c196a48fc58634611e427e65429a948119e40
2020-05-01 12:58:45 -07:00
anjali411
1f09f7ea44 Python API for Complex Storage and storage copy logic (#35771)
Summary:
Following up on this: https://github.com/pytorch/pytorch/pull/35851 cross dtype storage copy is not being used internally, so I have not included cross dtype copy for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35771

Differential Revision: D21319650

Pulled By: anjali411

fbshipit-source-id: 07c72996ee598eba0cf401ad61534494d6f5b5b3
2020-05-01 11:47:22 -07:00
kshitij12345
22708be5af Migrate tan from TH to ATen (CUDA) (#36906)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24641

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.tan(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.tan(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28325206200003095
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.28363607099998944
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.43924326799998425
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3754699589999859
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.38143782899999223
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7672172019999834
```

After:

```
torch.tan(a) a.numel() == 10000 for 20000 times torch.half
0.28982524599996395
torch.tan(a) a.numel() == 10000 for 20000 times torch.float
0.29121579000002384
torch.tan(a) a.numel() == 10000 for 20000 times torch.double
0.4599610559998837
torch.tan(a) a.numel() == 100000 for 20000 times torch.half
0.3557764019997194
torch.tan(a) a.numel() == 100000 for 20000 times torch.float
0.34793807599999127
torch.tan(a) a.numel() == 100000 for 20000 times torch.double
1.7564662459999454
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36906

Differential Revision: D21335320

Pulled By: VitalyFedyunin

fbshipit-source-id: efab9c175c60fb09223105380d48b93a81994fb0
2020-05-01 10:17:19 -07:00
Hong Xu
cd48fb5030 Vectorize linspace on CPU. (#27957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27957

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R) Xeon(R) E-2136):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.linspace(0, 10, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.linspace(0, 10, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.3964195849839598
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
1.2374563289922662
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.8631796519621275
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
1.6991038109990768
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.8358083459897898
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.7214750979910605
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.8356257299892604
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.706238206999842
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
1.7463878280250356
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
1.6172360889613628
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
1.8656846070080064
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
1.714238062966615
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
1.8272205490502529
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
1.6409171230043285
```

After:

```
torch.linspace(0, 10, 40000, dtype=torch.double) for 50000 times
1.0077099470072426
torch.linspace(0, 10, 400000, dtype=torch.double) for 5000 times
0.8227124120458029
torch.linspace(0, 10, 40000, dtype=torch.float) for 50000 times
1.0058343949494883
torch.linspace(0, 10, 400000, dtype=torch.float) for 5000 times
0.8376779520185664
torch.linspace(0, 10, 40000, dtype=torch.uint8) for 50000 times
1.903041019977536
torch.linspace(0, 10, 400000, dtype=torch.uint8) for 5000 times
1.7576498500420712
torch.linspace(0, 10, 40000, dtype=torch.int8) for 50000 times
1.7628699769848026
torch.linspace(0, 10, 400000, dtype=torch.int8) for 5000 times
1.6204477970022708
torch.linspace(0, 10, 40000, dtype=torch.int16) for 50000 times
2.0970272019621916
torch.linspace(0, 10, 400000, dtype=torch.int16) for 5000 times
1.9493417189805768
torch.linspace(0, 10, 40000, dtype=torch.int32) for 50000 times
2.29020385700278
torch.linspace(0, 10, 400000, dtype=torch.int32) for 5000 times
2.1212510910118
torch.linspace(0, 10, 40000, dtype=torch.int64) for 50000 times
2.3479344319785014
torch.linspace(0, 10, 400000, dtype=torch.int64) for 5000 times
2.156775983981788
```

Test Plan: Imported from OSS

Differential Revision: D20773454

Pulled By: VitalyFedyunin

fbshipit-source-id: ebeef59a90edde581669cc2afcc3d65929c8ac79
2020-04-30 14:26:24 -07:00
kshitij12345
7e9cc4df85 Migrate cos and cos_ from TH to ATen (CUDA) (#36653)
Summary:
Benchmark with same build settings on same system.

Closes https://github.com/pytorch/pytorch/issues/24545
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.cos(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.cos(a); torch.cuda.synchronize()',
                             setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                             number=t))
```

Before:

```
torch.cos(a) a.numel() == 10000 for 20000 times torch.half
0.2797315450006863
torch.cos(a) a.numel() == 10000 for 20000 times torch.float
0.283109110998339
torch.cos(a) a.numel() == 10000 for 20000 times torch.double
0.3648525129974587
torch.cos(a) a.numel() == 100000 for 20000 times torch.half
0.34239949499897193
torch.cos(a) a.numel() == 100000 for 20000 times torch.float
0.33680364199972246
torch.cos(a) a.numel() == 100000 for 20000 times torch.double
1.0512770260102116
```

After:

```
torch.cos(a) a.numel() == 10000 for 20000 times torch.half
0.285825898999974
torch.cos(a) a.numel() == 10000 for 20000 times torch.float
0.2781305120001889
torch.cos(a) a.numel() == 10000 for 20000 times torch.double
0.34188826099989456
torch.cos(a) a.numel() == 100000 for 20000 times torch.half
0.29040409300023384
torch.cos(a) a.numel() == 100000 for 20000 times torch.float
0.28678944200009937
torch.cos(a) a.numel() == 100000 for 20000 times torch.double
1.065477349000048
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36653

Differential Revision: D21164675

Pulled By: VitalyFedyunin

fbshipit-source-id: 5dd5d3af47c2a5527e1f4ab7669c2ed9a2293cee
2020-04-29 15:52:24 -07:00
Jesse Brizzi
bca82801e7 add support for generating Vandermonde matrices (#36725)
Summary:
Adds support for generating Vandermonde matrices based off of the Numpy implementation found [here](https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/twodim_base.py#L475-L563).

Adds test to ensure generated matrix matches expected Numpy implementation. Note test are only limited to torch.long and torch.double due to differences in now PyTorch and Numpy deal with type promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36725

Differential Revision: D21075138

Pulled By: jessebrizzi

fbshipit-source-id: 6bb1559e8247945714469b0e2b07c6f4d5fd1fd0
2020-04-29 13:16:26 -07:00
Nikita Shulga
1bb66a0cd4 Extend some of the basic ops to kHalf (#37121)
Summary:
Added enough operators to make sure that all unit tests from ATen/basic are passing, except for MM and IntArrayRefExpansion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37121

Test Plan: `./bin/basic --gtest_filter=--gtest_filter=BasicTest.BasicTestHalfCPU` + `python -c "import torch; x = torch.tensor([2], dtype=torch.half); print(torch.isfinite(x+x))"`

Differential Revision: D21296863

Pulled By: malfet

fbshipit-source-id: e03d7a6939df11f611a9b317543bac52403cd009
2020-04-29 10:49:16 -07:00
ashishfarmer
bbd2350c99 Disable tests failing on test2 in ROCm CI (#37427)
Summary:
This pull request disables the unit tests that were observed to be failing once `test2` was enabled. These tests will be one by one looked at and fixed at the earliest, but until then disabling them to unblock `test2`
The pull request also disables fftPlanDestroy for rocFFT to avoid double-freeing FFT handles

cc: ezyang jeffdaily
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37427

Differential Revision: D21302909

Pulled By: ezyang

fbshipit-source-id: ecadda3778e65b7f4f97e24b932b96b9ce928616
2020-04-29 09:56:28 -07:00
Pavel Belevich
ec8517b6df Move exponential_() to DistributionTemplates (#37456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37456

Fixes #37370

Test Plan: Imported from OSS

Differential Revision: D21290781

Pulled By: pbelevich

fbshipit-source-id: 2f516b5112b9ce1c9ba8967b3758decf86d65676
2020-04-29 08:07:35 -07:00
Pavel Belevich
06168bf17d Move geometric_() to DistributionTemplates (#37418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37418

Fixes #37369

Test Plan: Imported from OSS

Differential Revision: D21290757

Pulled By: pbelevich

fbshipit-source-id: 42133f35edcbe716a07987bef2e68a4cdc27236a
2020-04-29 08:07:30 -07:00
Pavel Belevich
ce6077d7a8 Move log_normal_() to DistributionTemplates (#37392)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37392

Fixes #37368

Test Plan: Imported from OSS

Differential Revision: D21290740

Pulled By: pbelevich

fbshipit-source-id: 15a76b2625d2ca8187c25333a86eecd111a259c6
2020-04-29 08:06:05 -07:00
kshitij12345
4e3dc34c47 add complex support to reciprocal_cuda kernel (#36749)
Summary:
dylanbespalko anjali411

Not sure if the test should be added to `test_torch` or `test_complex`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36749

Differential Revision: D21290529

Pulled By: anjali411

fbshipit-source-id: 07bc282e4c9480cd015ec5db104e79728437cd90
2020-04-28 21:51:46 -07:00
Emilio Castillo
273c464145 Fix TensorIterator::view_offsets_ size (#37214)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37084

There are 3 alternatives for this design.

This PR and the first one.
When a tensor is a scalar `ndim==0`, accessing view_offsets_[0] when doing reductions, yields an invalid offset for the index which is the output of `argmax` and `argmin`.

fba9b9a023/aten/src/ATen/native/cpu/Reduce.h (L217)

This also happens in cuda code:
fba9b9a023/aten/src/ATen/native/cuda/Reduce.cuh (L797)

The second alternative is to check the size of `view_offsets` before accessing it. But this introduces some burden.

The third alternative is related to the way that inputs are treated in `argmax` and `argmin`
depending on the `dim` argument value.

fba9b9a023/aten/src/ATen/native/ReduceOps.cpp (L775-L780)

If `dim` is not specified, then the scalar gets reshaped into a 1-dim tensor and everything works properly, since now `view_offsets` has an actual entry.
If dim is specified, then the input remains as a scalar causing the issue we see here.

This PR tries to solve it in a generic way for every case so I went with option 1. I am willing to discuss it and change if you think that the other alternatives are better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37214

Differential Revision: D21258320

Pulled By: ngimel

fbshipit-source-id: 46223412187bbba4bfa7337e3f1d2518db72dea2
2020-04-28 18:08:51 -07:00
anjali411
b8ec165c0d Fix failing test in test_torch.py (#37362)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37362

Differential Revision: D21264829

Pulled By: anjali411

fbshipit-source-id: cec6af84630378f03cb3863c85e161776af236cd
2020-04-27 16:42:11 -07:00
Mike Ruberry
b64fc3c4b5 Changes warnings generated in cpp to show point of Python origination (#36052)
Summary:
Today in PyTorch, warnings triggered in C++ are printed to Python users like this:

`../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.`

This may be unhelpful to Python users, who have complained it's difficult to relate these messages back to their programs. After this PR, warnings that go through the PyWarningHandler and allow it to add context print like this:

```
test/test_torch.py:16463: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:81.)
  cpu_result = getattr(cpu_tensor, op_str)(*cpu_args)
```

This relates the warning back to the user's program. The information about the cpp file and line number is preserved in the body of the warning message.

Some warnings, like those generated in the JIT, already account for a user's Python context, and so they specify that they should be printed verbatim and are unaffected by this change. Warnings originating in Python and warnings that go through c10's warning handler, which prints to cerr, are also unaffected.

A test is added to test_torch.py for this behavior. The test relies on uint8 indexing being deprecated and its warning originating from its current header file, which is an unfortunate dependency. We could implement a `torch.warn` function, instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36052

Differential Revision: D20887740

Pulled By: mruberry

fbshipit-source-id: d3515c6658a387acb7fccaf83f23dbb452f02847
2020-04-25 21:18:58 -07:00
Xiang Gao
d7f7c290e3 addmv migration [resubmit] (#37236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37236

Differential Revision: D21232988

Pulled By: anjali411

fbshipit-source-id: ac6c0ee018aef3c841b039d76e6e1fbb3cd0292d
2020-04-25 07:43:27 -07:00
anjali411
4f3946a89b Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#37193)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes

Old PR - https://github.com/pytorch/pytorch/pull/36747
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37193

Differential Revision: D21229373

Pulled By: anjali411

fbshipit-source-id: 8a086136d8c10dabe62358d276331e3f22bb2342
2020-04-24 15:05:50 -07:00
Alexander Fix
2baff9476e Test test_is_nonzero make expected exception inline (#37128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37128

In certain build modes (in fbcode, building a .par) the mechanism to get test output "expect" files doesn't work.
All other tests in test_torch.py already had assertExpectedInline instead of assertExpected, with the expected result inline in the file.
There was no equivalent for assertExpectedRaises, so I added one, and changed the tests for test_is_nonzero (the only test using this)

Test Plan: CI, specifically the test test_is_nonzero should pass

Reviewed By: malfet

Differential Revision: D21197651

fbshipit-source-id: 2a07079efdcf1f0b0abe60e92cadcf55d81d4b13
2020-04-24 13:12:31 -07:00
moto
5a27ec09b8 Add Inverse Short Time Fourier Transform in ATen native (#35569)
Summary:
Ported `torchaudio`'s implementation (test, and documentation as well) to ATen.

Note
 - Batch packing/unpacking is performed in Python. ATen implementation expects 4D input tensor.
 - The way `hop_length` is initialized in the same way as `stft` implementation. [The Torchaudio's version tried to mimic the same behavior but slightly different](7da61a4bee/torchaudio/functional.py (L152-L157)).

Closes https://github.com/pytorch/pytorch/issues/34827
Relates https://github.com/pytorch/pytorch/issues/3775
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35569

Differential Revision: D21178090

Pulled By: mthrok

fbshipit-source-id: 2701a8b241a36a6fb1b740c2fb2b07cb938185d4
2020-04-24 12:14:55 -07:00
kshitij12345
e98cdfa26f Migrate tanh from TH to ATen (CUDA) (#36995)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24642

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.tanh(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.tanh(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.tanh(a) a.numel() == 10000 for 20000 times torch.half
0.2816318240002147
torch.tanh(a) a.numel() == 10000 for 20000 times torch.float
0.2728829070001666
torch.tanh(a) a.numel() == 10000 for 20000 times torch.double
0.39797203200214426
torch.tanh(a) a.numel() == 100000 for 20000 times torch.half
0.3228214350019698
torch.tanh(a) a.numel() == 100000 for 20000 times torch.float
0.31780802399953245
torch.tanh(a) a.numel() == 100000 for 20000 times torch.double
1.3745740449994628
```

After:

```
torch.tanh(a) a.numel() == 10000 for 20000 times torch.half
0.27825374500025646
torch.tanh(a) a.numel() == 10000 for 20000 times torch.float
0.27764024499992956
torch.tanh(a) a.numel() == 10000 for 20000 times torch.double
0.3771585260001302
torch.tanh(a) a.numel() == 100000 for 20000 times torch.half
0.2995866400015075
torch.tanh(a) a.numel() == 100000 for 20000 times torch.float
0.28355561699936516
torch.tanh(a) a.numel() == 100000 for 20000 times torch.double
1.393811182002537
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36995

Differential Revision: D21163353

Pulled By: ngimel

fbshipit-source-id: e2216ff62cdfdd13b6a56daa63d4ef1440d991d4
2020-04-23 12:29:27 -07:00
Taylor Robie
7aec364bdf extend gather shape check to handle incorrectly sized outputs (#37102)
Summary:
Fixes a safety issue (Nonsense values and segfaults) introduced by https://github.com/pytorch/pytorch/pull/36875 when in-place gather tries to use incorrect shapes.

Consider the following block of code:
```
k0 = 8
k1 = 8
m = 100

x = torch.rand((k0, k1))
ind = torch.randint(0, k0, (m, k1))
output = torch.empty((m, k1))

print(torch.gather(x, 0, ind, out=output))
print(torch.gather(x, 1, ind, out=output))
```

The first gather is legal, the second is not. (`ind` and `output` need to be transposed) Previously this was caught when the kernel tried to restride inputs for TensorIterator, but we can no longer rely on those checks and must test explicitly. If `m` is small the second gather returns gibberish; if it is large enough to push the read out of memory block the program segfaults.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37102

Differential Revision: D21190580

Pulled By: robieta

fbshipit-source-id: 80175620d24ad3380d78995f7ec7dbf2627d2998
2020-04-23 11:47:01 -07:00
Anjali Chourdia
c306f2ed08 Revert D20660338: [pytorch][PR] Migrate addmv and mv from legacy to ATen native (CUDA & CPU)
Test Plan: revert-hammer

Differential Revision:
D20660338

Original commit changeset: db1f521f1241

fbshipit-source-id: 8616ddd7bbd8f00351cfc45331a09b0bc9aa28ea
2020-04-23 10:46:45 -07:00
Gao, Xiang
a38c6e0454 Migrate addmv and mv from legacy to ATen native (CUDA & CPU) (#30898)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/24605 https://github.com/pytorch/pytorch/issues/24535 https://github.com/pytorch/pytorch/issues/24739 https://github.com/pytorch/pytorch/issues/24680 https://github.com/pytorch/pytorch/issues/30986

This does not fix https://github.com/pytorch/pytorch/issues/29984, it will be fixed in later PR.

Most of this PR is just following the same logic inside TH and THC except the handle of n-dimensional zero-sized tensor, in specific the case:
```
(m,).addmv((m, 0), (0,), beta, alpha)
```

#  Legacy code bugs and how this PR deal with it

The above case is a case where BLAS often have a mismatch of semantics with PyTorch: For BLAS and cuBLAS, the above is a noop, but for PyTorch, it is a scalar-vector multiplication `output = beta * input`. The handle of this case is already very poor in legacy code and it is poorly tested:

For the CPU implementation, there are two code paths:
- Path 1: when dtype is float or double and `USE_BLAS`, then use BLAS
- Path 2: when other dtypes or not `USE_BLAS`, use a fallback kernel in PyTorch

For the CUDA implementation, there are also two code paths:
- Path 1: when float or double, then use `cublasSgemv` or `cublasDgemv` in cuBlas
- Path 2: when half, dispatch to `addmm`

`test_blas_alpha_beta_empty` is supposed to cover all cases, but unfortunately, it only tests the Path 1 of CUDA and Path 1 of CPU, and both uncovered paths (path 2 for CPU and path 2 for CUDA) are buggy in legacy code. In this PR, I expanded the coverage of `test_blas_alpha_beta_empty`, but unfortunately, I have to skip the `half` dtype on CUDA 9. See the description below for detail:

## Bug on CPU implementation

For the CPU implementation, the fallback kernel in path 2 already has the same semantics as PyTorch, not BLAS. But the code that tries to correct BLAS semantics to match PyTorch also runs on this case, leading to double correction, that is, `output = beta * input` now becomes `output = beta * beta * input`.

This leads to the issue https://github.com/pytorch/pytorch/issues/30986 I just opened, and it is fixed in this PR.

## Bug on CUDA implementation

For the CUDA implementation, path 2 dispatches to
```
(m, 1).addmm((m, 0), (0, 1), beta, alpha)
```
But unfortunately, for some old CUDA version when on old GPU on half dtype, the above is also noop, which is definitely not correct.

But from what I see, on newer CUDA version or newer GPU, this is not a problem. This is a bug of PyTorch in `addmm`, so I opened a new issue https://github.com/pytorch/pytorch/issues/31006 to track this problem. But this is highly likely a dependency bug for PyTorch originating from cuBLAS, and it is only on a rarely used edge case on old hardware and software, so this issue would be a `won't_fix` unless some real requirements strongly indicate that this should be fixed.

This issue is already with legacy code, and this PR does not make it worse. To prevent this issue from bothering us, I disable the test of `half` dtype for CUDA 9 when expanding the coverage of `test_blas_alpha_beta_empty`.

I promote a CircleCI CUDA 10.1 test to `XImportant` so that it runs on PRs, because the path 2 of CUDA implementation is only covered by this configuration. Let me know if I should revert this change.

## An additional problem

In legacy code for `addmv`, dtype `bfloat16` is enabled and dispatch to `addmm`, but `addmm` does not support `bfloat16` from what I test. I do the same thing in the new code. Let me know if I should do it differently.

# Benchmark

Code:
```python
import torch
print(torch.__version__)

for i in range(1000):
    torch.arange(i, device='cuda')

print('cpu')
for i in 10, 100, 1000, 10000:
    a = torch.randn((i,))
    b = torch.randn((i, i))
    c = torch.randn((i,))
    %timeit a.addmv(b, c, alpha=1, beta=2)

print('cuda')
for i in 10, 100, 1000, 10000:
    a = torch.randn((i,)).cuda()
    b = torch.randn((i, i)).cuda()
    c = torch.randn((i,)).cuda()
    torch.cuda.synchronize()
    %timeit a.addmv(b, c, alpha=1, beta=2); torch.cuda.synchronize()
```

Before:
```
1.5.0a0+2b45368
cpu
2.74 µs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
8.5 µs ± 85.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
686 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
74 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cuda
The slowest run took 4.81 times longer than the fastest. This could mean that an intermediate result is being cached.
27.6 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.3 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
20.5 µs ± 369 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
756 µs ± 6.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

After:
```
1.5.0a0+66b4034
cpu
3.29 µs ± 20 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.09 µs ± 7.41 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
687 µs ± 7.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
73.8 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
cuda
18.2 µs ± 478 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
17.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.5 µs ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
751 µs ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30898

Differential Revision: D20660338

Pulled By: anjali411

fbshipit-source-id: db1f521f124198f63545064026f93fcb16b68f18
2020-04-23 06:56:49 -07:00
Alexander Fix
b889e0da8a [torch] Excluding test_fft_input_modification without MKL (#36680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36680

If torch compiled without MKL, this test fails with torch.fft requiring MKL support

Test Plan: CI

Reviewed By: malfet

Differential Revision: D21051362

fbshipit-source-id: dd2e2c7d323622c1c25fc4c817b85d83d2241b3a
2020-04-22 21:58:02 -07:00
Ailing Zhang
efcbcca454 Revert D21138687: [pytorch][PR] Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex
Test Plan: revert-hammer

Differential Revision:
D21138687

Original commit changeset: ad3602ccf86c

fbshipit-source-id: 69eb031c1a7c3d5e4b9f4241fbdada8d5980535d
2020-04-22 14:49:45 -07:00
Emilio Castillo
5fc391a646 Enforce type promotion in torch.cat (#35030)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35014

CUDA `cat` implementation doesn't use `TensorIterator` so there is the need of manually doing some checks in the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35030

Differential Revision: D21155853

Pulled By: nairbv

fbshipit-source-id: 9e78bb7591f806734e12555831157061c925ff40
2020-04-22 13:35:07 -07:00
kshitij12345
a00d6758b8 Migrate cosh and cosh_ from TH to ATen (CUDA) (#36654)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24546

Benchmark with same build settings on same system.
gcc : version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA : 10.1
GPU : 1050ti

```python
import timeit

for n, t in [(10_000, 20000),
             (100_000, 20000)]:
    for dtype in ('torch.half', 'torch.float', 'torch.double'):
        print(f'torch.cosh(a) a.numel() == {n} for {t} times {dtype}')
        print(timeit.timeit(f'torch.cosh(a); torch.cuda.synchronize()',
                              setup=f'import torch; a=torch.arange({n}, dtype={dtype}, device="cuda")',
                              number=t))
```

Before:

```
torch.cosh(a) a.numel() == 10000 for 20000 times torch.half
0.2813017509997735
torch.cosh(a) a.numel() == 10000 for 20000 times torch.float
0.28355878599904827
torch.cosh(a) a.numel() == 10000 for 20000 times torch.double
0.27810572300040803
torch.cosh(a) a.numel() == 100000 for 20000 times torch.half
0.3239932899996347
torch.cosh(a) a.numel() == 100000 for 20000 times torch.float
0.321233343998756
torch.cosh(a) a.numel() == 100000 for 20000 times torch.double
0.5546665399997437
```

After:

```
torch.cosh(a) a.numel() == 10000 for 20000 times torch.half
0.2905335750001541
torch.cosh(a) a.numel() == 10000 for 20000 times torch.float
0.27596429500044906
torch.cosh(a) a.numel() == 10000 for 20000 times torch.double
0.30358699899989006
torch.cosh(a) a.numel() == 100000 for 20000 times torch.half
0.30139567500009434
torch.cosh(a) a.numel() == 100000 for 20000 times torch.float
0.30246640400036995
torch.cosh(a) a.numel() == 100000 for 20000 times torch.double
0.5403946970000106

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36654

Differential Revision: D21164606

Pulled By: VitalyFedyunin

fbshipit-source-id: 55e88f94044957f81599ae3c12cda38a3e2c985c
2020-04-22 10:16:24 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
anjali411
25eb250d77 Added complex dtypes to get_all_math_dtypes, complex acc type for cpu, fixed rdiv and pow for complex (#36747)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36730 https://github.com/pytorch/pytorch/issues/36057
Partially resolves: https://github.com/pytorch/pytorch/issues/36671
```
>>> 2j / torch.tensor([4], dtype = torch.complex64)
tensor([(0.0000+0.5000j)], dtype=torch.complex64)
>>> 1 / torch.tensor(3+4j)
tensor((0.1200-0.1600j), dtype=torch.complex64)
```
rdiv is more generally broken for all dtypes because it doesn't promote the types properly
eg.
```
>>> 1 / torch.tensor(2)
tensor(0)
>>> 2j / torch.tensor(4)
tensor(0)
```
so that issue should be fixed in a separate PR

Adding CPU acc types for complex
Added cumsum, cumprod for complex dtypes

Added complex dtypes to get_all_math_dtypes to expand testing for complex dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36747

Differential Revision: D21138687

Pulled By: anjali411

fbshipit-source-id: ad3602ccf86c70294a6e71e564cb0d46c393dfab
2020-04-22 08:52:41 -07:00
Mike Ruberry
4a2372bc90 Implements torch.isclose for complex tensors (#36456)
Summary:
Previously torch.isclose would RuntimeError when called on complex tensors. This update updates torch.isclose to run on complex tensors and be consistent with [NumPy](https://numpy.org/doc/1.18/reference/generated/numpy.isclose.html). However, NumPy's handling of NaN, -inf, and inf values is odd, so I adopted  Python's [cmath.isclose](https://docs.python.org/3/library/cmath.html) behavior when dealing with them. See https://github.com/numpy/numpy/issues/15959 for more on NumPy's behavior.

While implementing complex isclose I also simplified the isclose algorithm to:

- A is close to B if A and B are equal, if equal_nan is true then NaN is equal to NaN
- If A and B are finite, then A is close to B if `abs(a - b) <= (atol + abs(rtol * b))`

This PR also documents torch.isclose, since it was undocumented, and adds multiple tests for its behavior to test_torch.py since it had no dedicated tests.

The PR leaves equal_nan=True with complex inputs an error for now, pending the outcome of https://github.com/numpy/numpy/issues/15959.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36456

Differential Revision: D21159853

Pulled By: mruberry

fbshipit-source-id: fb18fa7048e6104cc24f5ce308fdfb0ba5e4bb30
2020-04-21 19:53:55 -07:00
Mike Ruberry
a850d8a526 Fixes exponential with lambda=0 (#36837)
Summary:
Resolves https://github.com/pytorch/pytorch/issues/36798.

In the future more thorough testing would be nice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36837

Differential Revision: D21102342

Pulled By: mruberry

fbshipit-source-id: 4fae45677e54b403296033720dfb13abca47f3a4
2020-04-21 17:34:07 -07:00
Jesse Brizzi
28f439d4f4 add absolute alias for abs (#36597)
Summary:
Adds an absolute alias for the abs function to match Numpy's use of both:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.absolute.html

Adds test to ensure the output from abs and absolute are the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36597

Differential Revision: D21024458

Pulled By: jessebrizzi

fbshipit-source-id: 4f2987e7bc7cde444d0a93e833a0350844b48d44
2020-04-20 14:49:51 -07:00
Mike Ruberry
0f0d69009e Makes CUDA -float->uint8 cast consistent with CPU (#36832)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/36807. Also updates the cast testing to catch issues like this better.

In the future a more constexpr based approach to casting would be nice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36832

Differential Revision: D21120822

Pulled By: mruberry

fbshipit-source-id: 9504ddd36cfe6d9f9f545fc277fef36855c1b221
2020-04-19 23:33:38 -07:00
Natalia Gimelshein
1b3741aa7f [WIP] reenable bfloat16 masked_select (#36859)
Summary:
Try reenabling bfloat16 masked_select, see it windows tests pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36859

Differential Revision: D21109535

Pulled By: ngimel

fbshipit-source-id: ca260943e6575d8e788e9fd87161a0d40d3d44fb
2020-04-19 15:41:32 -07:00
Brian Vaughan
54ed6fd3ee Use both absolute and relative tolerance in testing (#34258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34258

This PR allows both atol and rtol to be specified, uses defaults based on the prior analysis (spreadsheet attached to https://github.com/pytorch/pytorch/pull/32538), but retains the absolute tolerance behavior in cases where precision was previously specified explicitly.

Test Plan: Imported from OSS

Differential Revision: D21110255

Pulled By: nairbv

fbshipit-source-id: 57b3a004c7d5ac1be80ee765f03668b1b13f4a7e
2020-04-19 06:16:49 -07:00
Xiang Gao
6ba734bae9 Vectorize reduction when reducing on fastest striding dimension [resubmit] (#36873)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36873

Differential Revision: D21109194

Pulled By: ngimel

fbshipit-source-id: eb18c6b4394f19a6c5eca45ef4ce97d623e051bd
2020-04-18 16:27:00 -07:00
Yuxin Wu
a64ea8ea04 Back out "Vectorize reduction when reducing on fastest striding dimension" (#36854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36854

Original commit changeset: ea3f7f29709c

Test Plan: n/a

Differential Revision: D21103684

fbshipit-source-id: e4862b32bf9815486e5fa7e05b9816550e9b0263
2020-04-17 19:53:30 -07:00
Xiang Gao
d92005ff73 Vectorize reduction when reducing on fastest striding dimension (#36709)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36709

Test Plan: Imported from OSS

Differential Revision: D21083393

Pulled By: ngimel

fbshipit-source-id: ea3f7f29709c9a6e5b3ec45ba809cb2cf6c5e0c8
2020-04-17 10:12:49 -07:00
Mike Ruberry
d7fabfd5df Implements complex isfinite and isinf (#36648)
Summary:
Implements complex isfinite and isinf, consistent with NumPy.

A complex value is finite if and only if both its real and imaginary part are finite.

A complex value is infinite if and only if its real or imaginary part are infinite.

Old isfinite, isinf, and isnan tests are modernized and instead of fixtures the torch results are compared with NumPy. A new test is added for complex isfinite, isinf, and isnan. The docs for each function are updated to clarify what finite, infinite, and NaN values are.

The new tests rely on a new helper, _np_compare, that we'll likely want to generalize in the near future and use in more tests.

Addresses part of the complex support tasks. See https://github.com/pytorch/pytorch/issues/33152.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36648

Differential Revision: D21054766

Pulled By: mruberry

fbshipit-source-id: d947707c5437385775c82f4e6c722349ca5a2174
2020-04-16 09:09:02 -07:00
anjali411
9e016f77a8 Added complex types to get_all_dtypes and turned on masked_fill for complex (#36335)
Summary:
1. Added complex dtypes to get_all_dtypes to unify testing for complex dtypes with other dtypes so that they don't get out of sync with behavior supported for other dtypes.
2. resolves https://github.com/pytorch/pytorch/issues/36322, https://github.com/pytorch/pytorch/issues/36327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36335

Differential Revision: D21045603

Pulled By: anjali411

fbshipit-source-id: 5089306b66fdc18148e831f56298da5de673be67
2020-04-16 08:24:45 -07:00
lixinyu
1e7155caa5 Bucketization (#7284) (#34577)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34577

Test Plan: Imported from OSS

Differential Revision: D20380975

Pulled By: glaringlee

fbshipit-source-id: d75939bc54d98675f88d7037491a8420ac20847a
2020-04-15 10:32:51 -07:00
Vasiliy Kuznetsov
16e90eba59 hardsigmoid: add cuda kernels (#36351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36351

Adds CUDA kernels for hardsigmoid, to enable its use in training.

Note: the update to the cpu backward pass is to keep the cpu vs cuda
logic consistent, no change in functionality.

Test Plan:
add CI for the forward pass
run this for the backward pass:
https://gist.github.com/vkuzo/95957d365600f9ad10d25bd20f58cc1a

Imported from OSS

Differential Revision: D20955589

fbshipit-source-id: dc198aa6a58e1a7996e1831f1e479c398ffcbc90
2020-04-15 10:15:49 -07:00
xiaobingsuper
1a0b95e7e4 bfloat16: enable basic math function (#35172)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35172

Test Plan: Imported from OSS

Differential Revision: D20721146

Pulled By: ngimel

fbshipit-source-id: 25b2176d0a431706c51a7086e0642aff814d7148
2020-04-14 17:18:21 -07:00
Kurt Mohler
ce3555a635 Relanding masked_select cuda port from TH to ATen (#36539)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33054
Relanding PR https://github.com/pytorch/pytorch/issues/35429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36539

Differential Revision: D21007226

Pulled By: ngimel

fbshipit-source-id: 3c66ad073ff8e767ad120bc94120379d40346018
2020-04-14 14:03:59 -07:00
Natalia Gimelshein
f3f640d479 move test_abs to device-generic tests (#36465)
Summary:
Per title. test_abs used to be marked as slow_test and run on cpu only. Conceptually similar tests are done in TestTorchMathOps, so it's a matter of adding `abs` test there. 2 remaining checks (correct abs for large-valued long tensors, and correct abs for signed zeros) are factored into separate tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36465

Differential Revision: D21000248

Pulled By: ngimel

fbshipit-source-id: 8bc8b0da936b1c10fe016ff2f0dbb5ea428e7e61
2020-04-14 09:48:08 -07:00
Wanchao Liang
3526627f46 Use unittest assertWarns instead (#36411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36411

This PR remove pytorch specific defined assertwarns and use the unit
test one, also format some tests

Test Plan: Imported from OSS

Differential Revision: D20998159

Pulled By: wanchaol

fbshipit-source-id: 1280ecff2dd293b95a639d13cc7417fc819c2201
2020-04-13 15:56:42 -07:00
Kurt Mohler
2bc49a4b85 block_diag dense (#33449)
Summary:
Add block_diag function for dense tensors, based on scipy.linalg.block_diag

Closes https://github.com/pytorch/pytorch/issues/31932
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33449

Differential Revision: D20943099

Pulled By: zou3519

fbshipit-source-id: 8b5c9476fb5af959aafa4169612c660396d9b717
2020-04-13 10:04:55 -07:00
Max Balandat
379e4d9cad [pytorch] Make behavior of SobolEngine consistent w/ other RNG functions (#36427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36427

Addresses https://github.com/pytorch/pytorch/issues/36341

Test Plan: unit tests

Reviewed By: ldworkin

Differential Revision: D20952703

fbshipit-source-id: 28055f4c4c0f8012c2d96e473b822fa455dd833c
2020-04-13 07:53:33 -07:00
Mike Ruberry
b92f8d9b7e Revert D20950587: [pytorch][PR] Added complex types to get_all_dtypes and turned on masked_fill for complex
Test Plan: revert-hammer

Differential Revision:
D20950587

Original commit changeset: ba7c372a28f0

fbshipit-source-id: 487ac59a971b1ecefd20fd446385ba12334d9695
2020-04-12 21:33:17 -07:00