Commit Graph

1776 Commits

Author SHA1 Message Date
Rong Rong
58c13cf685 Back out "Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA"
Summary: Revert D25397144 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3

Test Plan: Revert Hammer

Reviewed By: janeyx99

Differential Revision: D25397572

fbshipit-source-id: 625ca2a32e4558ae4582a15697b6e1cc57cc1573
2020-12-08 07:52:59 -08:00
Rong Rong
39445f718c Revert D25375885: [pytorch][PR] Reenable some BF16 tests on CUDA
Test Plan: revert-hammer

Differential Revision:
D25375885 (e3893b867f)

Original commit changeset: 2e19fe725ae9

fbshipit-source-id: 69829f3fff4d4a2d1a71bb52e90d3c7f16b27fa3
2020-12-08 07:05:33 -08:00
Xiang Gao
e3893b867f Reenable some BF16 tests on CUDA (#48805)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48805

Reviewed By: agolynski

Differential Revision: D25375885

Pulled By: ailzhang

fbshipit-source-id: 2e19fe725ae9450bd1a2bc4e2d308c59b9f94fac
2020-12-07 16:16:07 -08:00
Gao, Xiang
a39398b9e5 CUDA BF16 norm (#48806)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48806

Reviewed By: mruberry

Differential Revision: D25358465

Pulled By: ngimel

fbshipit-source-id: 1a2afd86f39e96db0754d04bf81de045b1e1235c
2020-12-06 23:41:05 -08:00
Kurt Mohler
2cb9204159 Add nondeterministic alert to index_copy, median CUDA and kthvalue CUDA (#46942)
Summary:
Also fixes issue where skipped tests did not properly restore deterministic flag.

Fixes https://github.com/pytorch/pytorch/issues/46743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46942

Reviewed By: heitorschueroff

Differential Revision: D25298020

Pulled By: mruberry

fbshipit-source-id: 14b1680e1fa536ec72018d0cdb0a3cf83b098767
2020-12-03 11:03:07 -08:00
Edward Yang
f9a0abfc43 Fix code review from #48659 and #48116 (#48731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48731

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278034

Pulled By: ezyang

fbshipit-source-id: 73652311b48d8d80c06e9385b7ff18ef3a158ae8
2020-12-03 08:26:17 -08:00
kshitij12345
90a3049a9a [fix] repr(torch.device) (#48655)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48585

In the following commit 4c9eb57914, type of `DeviceIndex` was changed from `uint16_t` to `uint8_t`.
`uint8_t` is treated as ascii chars by std::cout and other stream operators. Hence the broken `repr`

Stackoverflow Reference: https://stackoverflow.com/questions/19562103/uint8-t-cant-be-printed-with-cout

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48655

Reviewed By: bdhirsh

Differential Revision: D25272289

Pulled By: ezyang

fbshipit-source-id: a1549f5f8d417138cf38795e4c373e3a487d3691
2020-12-02 15:48:17 -08:00
Erjia Guan
c98c98d77d Migrate fmod and fmod_ from TH to ATen (CUDA) (#47323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47323

Fixes #24565

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24763086

Pulled By: ejguan

fbshipit-source-id: fa004baea19bbbdbeb44814903db29226805ef0e
2020-12-02 09:38:29 -08:00
Edward Yang
b4f5efa7b2 Structured kernels generate Meta registrations (#48116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48116

If you port kernels to be structured, you get Meta kernels automatically
generated for you.  This is one payoff of structured kernels.

Code generation was mercifully really simple, although at risk of
"swiss cheese" syndrome: there's two new conditionals in the codegen
to tweak behavior when generating for meta keys.  It's not too bad
right now but there's a risk of things getting out of hand.  One
way to rationalize the logic here would be to transmit "TensorMeta-ness"
inside the TensorOptions (so tensor_from_meta can deal with it); then
the "Meta" kernel magic would literally just be generating empty
out_impls to call after all the scaffolding is done.  But I didn't
do this because it seemed like it would be more annoying short term.

Also had to teach resize_ to work on meta tensors, since we use them
to implement the out kernels.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer, ailzhang

Differential Revision: D25056640

Pulled By: ezyang

fbshipit-source-id: f8fcfa0dbb58a94d9b4196748f56e155f83b1521
2020-12-02 07:54:48 -08:00
kshitij12345
bcc85a363e [numpy] torch.sigmoid : promote integer inputs to float (#47551)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47551

Reviewed By: ngimel

Differential Revision: D25211953

Pulled By: mruberry

fbshipit-source-id: 9174cda401aeba0fd585a4c9bda166dbcf64f42f
2020-12-01 23:28:57 -08:00
Taylor Robie
27905dfe9c Expose CXX_FLAGS through __config__ (#47861)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47861

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25199263

Pulled By: robieta

fbshipit-source-id: 3cfdb0485d686a03a68dd0907d1733634857963f
2020-12-01 19:58:29 -08:00
Mike Ruberry
36c87f1243 Refactors test_torch.py to be fewer than 10k lines (#47356)
Summary:
Creates multiple new test suites to have fewer tests in test_torch.py, consistent with previous test suite creation like test_unary_ufuncs.py and test_linalg.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47356

Reviewed By: ngimel

Differential Revision: D25202268

Pulled By: mruberry

fbshipit-source-id: 75fde3ca76545d1b32b86d432a5cb7a5ba8f5bb6
2020-11-28 20:11:40 -08:00
kiyosora
272f4db043 Implement NumPy-like function torch.float_power() (#44937)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.float_power()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44937

Reviewed By: ngimel

Differential Revision: D25192119

Pulled By: mruberry

fbshipit-source-id: 2e446b8e0c2825f045fe057e30c9419335557a05
2020-11-27 18:01:42 -08:00
Antonio Cuni
344918576c Migrate eig from the TH to Aten (CUDA) (#44105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105

Reviewed By: ngimel

Differential Revision: D25192116

Pulled By: mruberry

fbshipit-source-id: 87f1ba4924b9174bfe0d9e2ab14bbe1c6bae879c
2020-11-27 15:15:48 -08:00
elfringham
db1b0b06c4 Flake8 fixes (#48453)
Summary:
Quiet errors from flake8. Only a couple of code changes for deprecated Python syntax from before 2.4. The rest is just adding noqa markers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48453

Reviewed By: mruberry

Differential Revision: D25181871

Pulled By: ngimel

fbshipit-source-id: f8d7298aae783b1bce2a46827b088fc390970641
2020-11-25 19:09:50 -08:00
Xiao Wang
4ab2055857 Re-enable only cuda tests wrongly disabled before (#48429)
Summary:
Close https://github.com/pytorch/pytorch/issues/46536

Re-enable only cuda tests wrongly disabled in https://github.com/pytorch/pytorch/pull/45332

See discussions https://github.com/pytorch/pytorch/issues/46536#issuecomment-721386038 and https://github.com/pytorch/pytorch/pull/45332#issuecomment-721350987

~~See also https://github.com/pytorch/pytorch/pull/47237 and https://github.com/pytorch/pytorch/pull/47642~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48429

Reviewed By: ngimel

Differential Revision: D25176368

Pulled By: mruberry

fbshipit-source-id: 3822f5a45e58c0e387624e70ea272d16218901a9
2020-11-25 13:26:35 -08:00
kshitij12345
9ecaeb0962 [numpy] Add unary-ufunc tests for erf variants (#47155)
Summary:
Adding Unary Ufunc Test entry for `erf` variants.

We use scipy functions for reference implementation.

We can later update the tests once these functions will update integer input to float.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47155

Reviewed By: ngimel

Differential Revision: D25176654

Pulled By: mruberry

fbshipit-source-id: cb08efed1468b27650cec4f87a9a34e999ebd810
2020-11-25 13:20:14 -08:00
Fayçal Arbai
2e0a8b75d8 An implementation of torch.tile as requested in pytorch/pytorch#38349 (#47974)
Summary:
The approach is to simply reuse `torch.repeat` but adding one more functionality to tile, which is to prepend 1's to reps arrays if there are more dimensions to the tensors than the reps given in input. Thus for a tensor of shape (64, 3, 24, 24) and reps of (2, 2) will become (1, 1, 2, 2), which is what NumPy does.

I've encountered some instability with the test on my end, where I could get a random failure of the test (due to, sometimes, random value of `self.dim()`, and sometimes, segfaults). I'd appreciate any feedback on the test or an explanation for this instability so I can this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47974

Reviewed By: ngimel

Differential Revision: D25148963

Pulled By: mruberry

fbshipit-source-id: bf63b72c6fe3d3998a682822e669666f7cc97c58
2020-11-24 18:07:25 -08:00
Kurt Mohler
b6654906c7 Fix assertEqual's handling of numpy array inputs (#48217)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48217

Reviewed By: mrshenli

Differential Revision: D25119607

Pulled By: mruberry

fbshipit-source-id: efe84380d3797d242c2aa7d43d2209bcba89cee0
2020-11-22 00:13:42 -08:00
Nikita Shulga
dc843fe197 Fix test_ldexp on Windows (#48335)
Summary:
Force `torch.randint` to generate tensor of int32 rather than tensor of int64
Delete unneeded copies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48335

Reviewed By: ranman

Differential Revision: D25133312

Pulled By: malfet

fbshipit-source-id: 70bfcb6b7ff3bea611c4277e6634dc7473541288
2020-11-20 15:41:59 -08:00
Randall Hunt
562d4c3bc5 Add basic ldexp operator for numpy compatibility (#45370)
Summary:
Adds ldexp operator for https://github.com/pytorch/pytorch/issues/38349

I'm not entirely sure the changes to `NamedRegistrations.cpp` were needed but I saw other operators in there so I added it.

Normally the ldexp operator is used along with the frexp to construct and deconstruct floating point values. This is useful for performing operations on either the mantissa and exponent portions of floating point values.

Sleef, std math.h, and cuda support both ldexp and frexp but not for all data types. I wasn't able to figure out how to get the iterators to play nicely with a vectorized kernel so I have left this with just the normal CPU kernel for now.

This is the first operator I'm adding so please review with an eye for errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45370

Reviewed By: mruberry

Differential Revision: D24333516

Pulled By: ranman

fbshipit-source-id: 2df78088f00aa9789aae1124eda399771e120d3f
2020-11-20 04:09:39 -08:00
kiyosora
008f840e7a Implement in-place method torch.cumsum_ and torch.cumprod_ (#47651)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47651

Reviewed By: zou3519

Differential Revision: D24992438

Pulled By: ezyang

fbshipit-source-id: c38bea55f4af1fc92be780eaa8e1d462316e6192
2020-11-19 11:20:12 -08:00
mfkasim91
8819bad86c Implement igammac (3rd PR) (#48171)
Summary:
Related: https://github.com/pytorch/pytorch/issues/46183 (torch.igamma)
This is the regularized upper incomplete gamma function.

This is supposed to be exactly the same as https://github.com/pytorch/pytorch/issues/47463, but after rebasing the `viable/strict` branch.

cc: mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48171

Reviewed By: zhangguanheng66

Differential Revision: D25060107

Pulled By: mruberry

fbshipit-source-id: 89780dea21dbb2141cbc4f7f18192cb78a769b17
2020-11-18 23:44:32 -08:00
Edward Yang
a97d059614 Get TestTorch.test_empty_meta working again (#48113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48113

Fix is simple: just treat Meta as a backend covered by AutogradOther.
This semantically makes sense, since meta kernels are just like regular
CPU/CUDA kernels, they just don't do any compute.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25056641

Pulled By: ezyang

fbshipit-source-id: 7b68911982352b3e0ee8616b38cd9c70bd58a740
2020-11-18 19:50:27 -08:00
Scott Wolchok
4c9eb57914 [PyTorch] Narrow Device to 2 bytes by narrowing DeviceType and DeviceIndex (#47023)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023

DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430

Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect

Reviewed By: dzhulgakov

Differential Revision: D24605460

fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
2020-11-18 19:39:40 -08:00
Mike Ruberry
ea1e78a0c5 Revert D24853669: [pytorch][PR] Migrate eig from the TH to Aten (CUDA)
Test Plan: revert-hammer

Differential Revision:
D24853669 (866f8591be)

Original commit changeset: a513242dc7f4

fbshipit-source-id: a0c8c424b61b1e627d9102de6b4c6d0717a6c06d
2020-11-18 16:53:18 -08:00
Antonio Cuni
866f8591be Migrate eig from the TH to Aten (CUDA) (#44105)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44105

Reviewed By: heitorschueroff

Differential Revision: D24853669

Pulled By: mruberry

fbshipit-source-id: a513242dc7f49f55dbc6046c18d8a9d9aa2aaf8d
2020-11-18 12:10:18 -08:00
kshitij12345
68a3a3f3b5 Add torch.swapdims and torch.swapaxes (#46041)
Summary:
Reference https://github.com/pytorch/pytorch/issues/38349

Delegates to `torch.transpose` (not sure what is the best way to alias)

TODO:
* [x] Add test
* [x] Add documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46041

Reviewed By: gchanan

Differential Revision: D25022816

Pulled By: mruberry

fbshipit-source-id: c80223d081cef84f523ef9b23fbedeb2f8c1efc5
2020-11-18 11:35:53 -08:00
Ivan Yashchuk
81b1673a21 Enable complex tests that depend on batched matmul on CUDA (#47910)
Summary:
Now when https://github.com/pytorch/pytorch/pull/42553 is merged we can delete a bit of code from the tests and enable some of the skipped complex tests.

Unfortunately, `test_pinverse_complex_xfailed` and `test_symeig_complex_xfailed` had bugs and it wasn't caught automatically that these tests xpass. Need to be careful next time with `unittest.expectedFailure`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47910

Reviewed By: zhangguanheng66

Differential Revision: D25052130

Pulled By: mruberry

fbshipit-source-id: 29512995c024b882f9cb78b7bede77733d5762d0
2020-11-18 10:44:47 -08:00
Heitor Schueroff
2ff748a680 Move kthvalue scalar test to separate method for XLA (#48042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48042

Moving scalar test to a separate method so the XLA team can continue to test for the other cases without failing. Requested here https://github.com/pytorch/xla/issues/2620#issuecomment-725696108

Test Plan: Imported from OSS

Reviewed By: zhangguanheng66

Differential Revision: D25055677

Pulled By: heitorschueroff

fbshipit-source-id: 5da66bac78ea197821fee0b9b8a213ff2dc19c67
2020-11-18 07:49:14 -08:00
Xiang Gao
d293413b3e Batched matmul dtypes (#47873)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47873

Reviewed By: navahgar

Differential Revision: D24928256

Pulled By: anjali411

fbshipit-source-id: a26aef7a15a13fc0b5716e905971265d8b1cea61
2020-11-14 22:45:48 -08:00
anjali411
db1f217d8d Add complex support for torch.addcmul and torch.addcdiv (#46639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46639

Resolves: https://github.com/pytorch/pytorch/issues/46546#issuecomment-713122245

Test Plan: Imported from OSS

Reviewed By: izdeby, ansley

Differential Revision: D24879099

Pulled By: anjali411

fbshipit-source-id: 76131dc68ac964e67a633f62e07f7c799df4463e
2020-11-14 21:27:34 -08:00
Ivan Yashchuk
260daf088d Added linalg.cholesky (#46083)
Summary:
This PR adds `torch.linalg.cholesky` function that matches `numpy.linalg.cholesky`.

Fixed `lda` argument to `lapackCholesky` calls.
Added `random_hermitian_pd_matrix` helper function for tests.

Ref https://github.com/pytorch/pytorch/issues/42666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46083

Reviewed By: ailzhang

Differential Revision: D24861752

Pulled By: mruberry

fbshipit-source-id: 214dbceb4e8a2c589df209493efd843962d25593
2020-11-13 16:50:40 -08:00
Richard Zou
1c7c612af0 Revert D24543682: [pytorch][PR] Added support for complex input for torch.lu_solve
Test Plan: revert-hammer

Differential Revision:
D24543682 (ffd0003022)

Original commit changeset: 165bde39ef95

fbshipit-source-id: 790b4157fdbc7149aaf0748555efe6daed7e1a23
2020-11-13 08:24:53 -08:00
Ivan Yashchuk
ffd0003022 Added support for complex input for torch.lu_solve (#46862)
Summary:
`torch.lu_solve` now works for complex inputs both on CPU and GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex dtypes, but I didn't modify/improve the body of the tests.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46862

Reviewed By: nikithamalgifb

Differential Revision: D24543682

Pulled By: anjali411

fbshipit-source-id: 165bde39ef95cafebf976c5ba4b487297efe8433
2020-11-13 02:35:31 -08:00
Gao, Xiang
0652d755d3 Fix some flaky tests in test_torch.py and test_nn.py (#46941)
Summary:
Fixed test:
- `test_is_nonzero`, this is asserting exact match, which is flaky when `TORCH_SHOW_CPP_STACKTRACES=1`, I changed this to non-exact assert
- `test_pinverse` TF32
- `test_symeig` TF32
- `test_triangular_solve_batched_many_batches_cpu_float64` precision on CPU BLAS
- `test_qr` TF32, as well as the tensor factory forgets a `dtype=dtype`
- `test_lu` TF32
- `ConvTranspose2d` TF32
- `Conv3d_1x1x1_no_bias` TF32
- `Transformer*` TF32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46941

Reviewed By: heitorschueroff

Differential Revision: D24852725

Pulled By: mruberry

fbshipit-source-id: ccd4740cc643476178d81059d1c78da34e5082ed
2020-11-12 22:35:42 -08:00
kshitij12345
3649a2c170 [numpy] torch.sqrt : promote integer inputs to float (#47293)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47293

Reviewed By: malfet

Differential Revision: D24855994

Pulled By: mruberry

fbshipit-source-id: 1e6752f2eeba6d638dea0bdea0c650cf722718c9
2020-11-12 16:16:09 -08:00
Ivan Yashchuk
149190c014 Added CUDA support for complex input for torch.solve (#47045)
Summary:
`torch.solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.
Differentiation also works correctly with complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/41084
Ref. https://github.com/pytorch/pytorch/issues/33152

anjali411 I hope you don't mind that I took over https://github.com/pytorch/pytorch/pull/42737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47045

Reviewed By: nikithamalgifb

Differential Revision: D24921503

Pulled By: anjali411

fbshipit-source-id: 4c3fc4f193a84b6e28c43c08672d480715000923
2020-11-12 12:22:59 -08:00
Gregory Chanan
b6cb2caa68 Revert "Fixed einsum compatibility/performance issues (#46398)" (#47821)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47821

This reverts commit a5c65b86ce.

 Conflicts:
	test/test_linalg.py

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24909923

Pulled By: gchanan

fbshipit-source-id: 9dcf98e7c4a3c7e5aaffe475867fa086f3bb6ff2
2020-11-12 08:11:40 -08:00
anjali411
e1ee3bfc0e Port bmm and baddbmm from TH to ATen (#42553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42553

Ports `torch.bmm` and `torch.baddbmm` from TH to ATen, as well as adds support for complex dtypes. Also removes dead TH code for Level 2 functions.

Closes #24539

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24893511

Pulled By: anjali411

fbshipit-source-id: 0eba3f2aec99c48b3018a5264ee7789279cfab58
2020-11-12 07:57:42 -08:00
Ivan Yashchuk
52ec8b9340 Added CUDA support for complex input for torch.triangular_solve (#46916)
Summary:
`torch.triangular_solve` now works for complex inputs on GPU.
I moved the existing tests to `test_linalg.py` and modified them to test complex and float32 dtypes.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46916

Reviewed By: navahgar, agolynski

Differential Revision: D24706647

Pulled By: anjali411

fbshipit-source-id: fe780eac93d2ae1b2549539bb385e5fac25213b3
2020-11-11 16:08:11 -08:00
Ivan Yashchuk
a1db5b0f2b Added CUDA support for complex input for torch.inverse #2 (#47595)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Opening a new PR here. The previous PR was merged and reverted due to a bug in tests marked with `slowTest`.
Previous PR https://github.com/pytorch/pytorch/pull/45034

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47595

Reviewed By: navahgar

Differential Revision: D24840955

Pulled By: anjali411

fbshipit-source-id: ec49fffdc4b3cb4ae7507270fa24e127be14f59b
2020-11-11 11:06:08 -08:00
Heitor Schueroff
a5c65b86ce Fixed einsum compatibility/performance issues (#46398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46398

This PR makes torch.einsum compatible with numpy.einsum except for the sublist input option as requested here https://github.com/pytorch/pytorch/issues/21412. It also fixed 2 performance issues linked below and adds a check for reducing to torch.dot instead of torch.bmm which is faster in some cases.

fixes #45854, #37628, #30194, #15671

fixes #41467 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.randn(10000, 100, 101, device='cuda')
b = torch.randn(10000, 101, 3, device='cuda')

c = torch.randn(10000, 100, 1, device='cuda')
d = torch.randn(10000, 100, 1, 3, device='cuda')

print(Timer(
    stmt='torch.einsum("bij,bjf->bif", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("bic,bicf->bif", c, d)',
    globals={'c': c, 'd': d}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413850>
torch.einsum("bij,bjf->bif", a, b)
  Median: 4.53 ms
  IQR:    0.00 ms (4.53 to 4.53)
  45 measurements, 1 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7fa37c413700>
torch.einsum("bic,bicf->bif", c, d)
  Median: 63.86 us
  IQR:    1.52 us (63.22 to 64.73)
  4 measurements, 1000 runs per measurement, 1 thread
```

fixes #32591 with benchmark below
```python
import torch
from torch.utils.benchmark import Timer

a = torch.rand(1, 1, 16, 2, 16, 2, 16, 2, 2, 2, 2, device="cuda")
b = torch.rand(729, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, device="cuda")

print(Timer(
    stmt='(a * b).sum(dim = (-3, -2, -1))',
    globals={'a': a, 'b': b}
).blocked_autorange())

print()

print(Timer(
    stmt='torch.einsum("...ijk, ...ijk -> ...", a, b)',
    globals={'a': a, 'b': b}
).blocked_autorange())
```
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de28850>
(a * b).sum(dim = (-3, -2, -1))
  Median: 17.86 ms
  2 measurements, 10 runs per measurement, 1 thread

<torch.utils.benchmark.utils.common.Measurement object at 0x7efe0de286a0>
torch.einsum("...ijk, ...ijk -> ...", a, b)
  Median: 296.11 us
  IQR:    1.38 us (295.42 to 296.81)
  662 measurements, 1 runs per measurement, 1 thread
```

TODO

- [x] add support for ellipsis broadcasting
- [x] fix corner case issues with sumproduct_pair
- [x] update docs and add more comments
- [x] add tests for error cases

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860367

Pulled By: heitorschueroff

fbshipit-source-id: 31110ee598fd598a43acccf07929b67daee160f9
2020-11-10 19:38:43 -08:00
Heitor Schueroff
bf6a156f64 Fix kthvalue error for scalar input (#47600)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47600

fixes https://github.com/pytorch/pytorch/issues/30818

Note that the median case was already fixed by https://github.com/pytorch/pytorch/pull/45847

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24860337

Pulled By: heitorschueroff

fbshipit-source-id: 69ccbbb6c7c86671e5712b1c2056c012d898b4f2
2020-11-10 17:21:52 -08:00
kshitij12345
6575e674ce [numpy] torch.{all, any} : Extend Dtype Support (#44790)
Summary:
Reference https://github.com/pytorch/pytorch/issues/44779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44790

Reviewed By: bdhirsh

Differential Revision: D24393119

Pulled By: heitorschueroff

fbshipit-source-id: a9b88e9d06b3c282f2e5360b6eaea4ae8ef77c1d
2020-11-10 17:11:39 -08:00
Natalia Gimelshein
c9d37675b2 Back out "[pytorch][PR] The dimension being reduced should not be coalesced by TensorIterator" (#47642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47642

Original commit changeset: 02bb2b15694c

Test Plan: Covered by CI tests

Reviewed By: anjali411

Differential Revision: D24849072

fbshipit-source-id: a8790cbf46936aee7a6f504dac8595997175fc65
2020-11-10 16:31:33 -08:00
Radhakrishnan Venkataramani
163adb9fa7 Add HalfToFloat + FloatToHalf operators to PyTorch (#45092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45092

Adding two operators
1. at::float_to_half -> Converts FP32 tensor to FP16 tensor
2. at::half_to_float -> Converts FP16 tensor to FP32 tensor.

These operators internally use the kernel provided by FBGeMM. Both C2 and PT will use the same FBGeMM kernel underneath.

Test Plan:
buck test //caffe2/test:torch -- .*test_half_tensor.*

Run benchmark locally using

```
buck run //caffe2/benchmarks/operator_benchmark/pt:tensor_to_test
```

AI Bench results are pending. I expect that not to finish as we have large queue with jobs pending for 2+ days.

Benchmark for 512x512 tensor with FbGeMM implementation

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1246.332

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 1734.304
```

Benchmark for 512x512 tensor trunk with no FbGeMM integration.

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: FloatToHalfTensorConversionBenchmark
# Mode: Eager
# Name: FloatToHalfTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 169045.724

# Benchmarking PyTorch: HalfToFloatTensorConversionBenchmark
# Mode: Eager
# Name: HalfToFloatTensorConversionBenchmark_M512_N512_cpu
# Input: M: 512, N: 512, device: cpu
Forward Execution Time (us) : 152382.494
```

Reviewed By: ngimel

Differential Revision: D23824869

fbshipit-source-id: ef044459b6c8c6e5ddded72080204c6a0ab4582c
2020-11-10 12:00:53 -08:00
Gregory Chanan
65a72cae2c Fix type promotion for trace on CPU. (#47305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47305

Fixes https://github.com/pytorch/pytorch/issues/47127.

Ideally this would just use diag and sum (as the CUDA implementation does), but that seems to have performance problems, which I'll link in the github PR.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D24729627

Pulled By: gchanan

fbshipit-source-id: 151b786b53e7b958f0929c803dbf8e95981c6884
2020-11-10 07:46:03 -08:00
John Kilpatrick
8aca85dbcd Add diagflat complex support (#47564)
Summary:
Adds complex numbers support for `torch.diag`
``` python
>>> import torch
>>> a = torch.ones(2, dtype=torch.complex128)
>>> torch.diagflat(a)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], dtype=torch.complex128)
>>> b = a.cuda()
>>> torch.diagflat(b)
tensor([[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]], device='cuda:0', dtype=torch.complex128)
```

Note that automatic differentiation isn't implemented:
``` python
>>> d = torch.ones(1, dtype=torch.complex128, requires_grad=True)
>>> torch.diagflat(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: diag does not support automatic differentiation for outputs with complex dtype.
```

Fixes https://github.com/pytorch/pytorch/issues/47499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47564

Reviewed By: heitorschueroff

Differential Revision: D24844467

Pulled By: anjali411

fbshipit-source-id: 9c8cb795d52880b7dcffab0c059b0f6c2e5ef151
2020-11-09 20:28:23 -08:00
Xiang Gao
f23a2a1115 The dimension being reduced should not be coalesced by TensorIterator (#47237)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/37583#issuecomment-720172838

Also add overload of `<<` for convenience of debugging.

This PR is tested by `test_reduction_split_cuda` which was added in https://github.com/pytorch/pytorch/pull/37788.

Reproduce
```python
import torch

a = torch.zeros(8, 1, 128, 1024, 1024)
a.cuda().sum(1)
```

Before

```
TensorIterator @ 0x7ffd05b10ba0 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1073741824]
  strides(*) = {
    (0) = [4]
    (1) = [4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

After

```
TensorIterator @ 0x7fffc9051010 {
  ntensors() = 2
  noutputs() = 1
  shape() = [1, 1073741824]
  strides(*) = {
    (0) = [0, 4]
    (1) = [536870912, 4]
  }
  dtype(*) = {
    (0) = Float
    (1) = Float
  }
  is_reduction_ = 1
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47237

Reviewed By: ejguan

Differential Revision: D24734763

Pulled By: ngimel

fbshipit-source-id: 02bb2b15694c68f96434f55033b63b6e5ff7085b
2020-11-07 01:30:24 -08:00
Xiong Wei
f90da88d8f Add complex support for torch.mean [CUDA] (#47048)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47048

Reviewed By: heitorschueroff

Differential Revision: D24729895

Pulled By: anjali411

fbshipit-source-id: 8e948480eb87c37de810207edf909375c0380772
2020-11-06 21:29:19 -08:00
Howard Huang
451e7d3db4 Enable diag for bool Tensors (#47455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47455

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772483

Pulled By: H-Huang

fbshipit-source-id: 08ea4af4352972617db3c6475943b326f36b3049
2020-11-06 21:29:17 -08:00
Howard Huang
3253ccbd9f Add bool tensor support for where (#47454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47454

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D24772482

Pulled By: H-Huang

fbshipit-source-id: ea488aae5bf64ac20f7a5d001e8edf55eed16eaf
2020-11-06 21:26:24 -08:00
Rong Rong
5614f72534 Suppres test issues in test_torch running in sandcastle (#47474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47474

After enabling GPU/Re, some issues were specific to those runs

Test Plan:
```
buck test -c test.external_runner=tpx mode/opt //caffe2/test:torch_cuda -- --use-remote-execution --force-tpx --run-disabled
```

Reviewed By: malfet, janeyx99

Differential Revision: D24771578

fbshipit-source-id: 1ada79dae12c8cb6f795a0d261c60f038eee2dfb
2020-11-06 10:34:28 -08:00
Edward Yang
1aeefcdaa6 Revert D24730264: [pytorch][PR] Added CUDA support for complex input for torch.inverse
Test Plan: revert-hammer

Differential Revision:
D24730264 (33acbedace)

Original commit changeset: b9c94ec46301

fbshipit-source-id: beb9263700e9bc92685f74c37c46aa33f3b595b9
2020-11-06 07:28:14 -08:00
Ivan Yashchuk
33acbedace Added CUDA support for complex input for torch.inverse (#45034)
Summary:
`torch.inverse` now works for complex inputs on GPU.
Test cases with complex matrices are xfailed for now. For example, batched matmul does not work with complex yet.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45034

Reviewed By: zou3519

Differential Revision: D24730264

Pulled By: anjali411

fbshipit-source-id: b9c94ec463012913c117278a884adeee96ea02aa
2020-11-05 16:30:11 -08:00
Heitor Schueroff
a4ba018e57 Updated docs/test for dot and vdot (#47242)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47242

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D24733771

Pulled By: heitorschueroff

fbshipit-source-id: 92e3b0e28e0565918335fa85d52abe5db9eeff57
2020-11-05 06:27:50 -08:00
Xiang Gao
f19637e6ee Expand the test of torch.addbmm and torch.baddbmm (#47079)
Summary:
This is to satisfy the request at https://github.com/pytorch/pytorch/pull/42553#issuecomment-673673914. See also https://github.com/pytorch/pytorch/pull/47124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47079

Reviewed By: ejguan

Differential Revision: D24735356

Pulled By: ngimel

fbshipit-source-id: 122fceb4902658f350c2fd6f92455adadd0ec2a4
2020-11-04 21:11:26 -08:00
Xiang Gao
030caa190f Expand the test of torch.bmm on CUDA (#47124)
Summary:
basically https://github.com/pytorch/pytorch/pull/47070, enabled on all CI with `ci-all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47124

Reviewed By: ejguan

Differential Revision: D24735130

Pulled By: ngimel

fbshipit-source-id: c2124562a9f9d1caf24686e5d8a1106c79366233
2020-11-04 17:29:34 -08:00
Brian Hirsh
fe17269e75 Revert "Revert D24335982: explicitly error out in comparison ops when the types don't match" (#47288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47288

This reverts commit b3eb0c86cf.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24706531

Pulled By: bdhirsh

fbshipit-source-id: f3bf34ddba7882932155819251b6c7dcb5c6b56c
2020-11-04 09:27:47 -08:00
Erjia Guan
f1ac63d324 Implement copysign (#46396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46396

Related #38349

[numpy](https://numpy.org/doc/stable/reference/generated/numpy.copysign.html?highlight=copysign#numpy.copysign)
- No in-place function
- No method
- Optional output
- Available: byte, char, bool, int, short, long, float, double, half
- Integral promoted to float
- Not available: float/double complex

`c = np.copysign(a, b)`
|  a |  b |  c | a.grad |
| -1 | -1 | -1 |   1  |
| -0 | -1 | -0 |   0  |
|  0 | -1 | -0 |  0  |
|  1 | -1 | -1 |  -1  |
| -1 | -0 |  -1 |  1  |
| -0 | -0 |  0 |  0  |
|  0 | -0 |  0 |   0  |
|  1 | -0 |  -1 |   -1  |
| -1 |  0 |  1 |  -1  |
| -0 |  0 |  0 |  0  |
|  0 |  0 |  0 |   0  |
|  1 |  0 |  1 |   1  |
| -1 |  1 |  1 |  -1  |
| -0 |  1 |  0 |  0  |
|  0 |  1 |  0 |   0  |
|  1 |  1 |  1 |   1  |

This function becomes **non-differentiable** at `a=0` for any `b`. So, in my opinion, we may set the gradient for `a=0` to 0.

TODO:
- [x] test (cpu/gpu)
- [x] doc
- [x] ~kernel_vec~

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24401366

Pulled By: ejguan

fbshipit-source-id: 3621c5ff74b185376a3705589983bb5197ab896d
2020-11-04 08:08:57 -08:00
Qi Zhou
0ec717c830 Support int32 indices and offsets in nn.EmbeddingBag (#46758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46758

It's in general helpful to support int32 indices and offsets, especially when such tensors are large and need to be transferred to accelerator backends. Since it may not be very useful to support the combination of int32 indices and int64 offsets, here we enforce that these two must have the same type.

Test Plan: unit tests

Reviewed By: ngimel

Differential Revision: D24470808

fbshipit-source-id: 94b8a1d0b7fc9fe3d128247aa042c04d7c227f0b
2020-11-03 23:33:50 -08:00
Howard Huang
a8ef4d3f0b Provide 'out' parameter for 'tensordot' (#47278)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42102

Added an optional out parameter to the tensordot operation to allow using buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47278

Test Plan: pytest test/test_torch.py -k tensordot -v

Reviewed By: agolynski

Differential Revision: D24706258

Pulled By: H-Huang

fbshipit-source-id: eb4bcd114795f67de3a670291034107d2826ea69
2020-11-03 15:56:00 -08:00
Xiao Wang
774b638eb6 Change largeCUDATensorTest to largeTensorTest+onlyCUDA; add a buffer to large cuda tensor test (#45332)
Summary:
Effectively, `largeCUDATensorTest` = `largeTensorTest` + `onlyCUDA`.

There was this problem where a user got OOM for a `largeCUDATensorTest('16GB')` on a 16GB V100. This decorator was checking total memory for a GPU device, however in most cases, we can't allocate all of the memory that a GPU has. So, it would be beneficial that we have a buffer on this `largeTensorTest` check for CUDA. I added a 10% buffer to it.

Definition of `largeTensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L560-L578)

`_has_sufficient_memory`

d22dd80128/torch/testing/_internal/common_device_type.py (L535-L557)

`largeCUDATensorTest`

d22dd80128/torch/testing/_internal/common_device_type.py (L526-L532)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45332

Reviewed By: ngimel

Differential Revision: D24698690

Pulled By: mruberry

fbshipit-source-id: a77544478e45ce271f6639ea04e87700574ae307
2020-11-03 11:43:49 -08:00
Richard Zou
86151da19e Port CPU Trace from TH to ATen (#47126)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47126

Context
-------
This PR is a rebase of shihongzhi's https://github.com/pytorch/pytorch/pull/35360.
I forgot to merge it back when it was submitted so I rebased it and ran new benchmarks on it.

Benchmarks
----------

TL;DR: The op has more overhead than the TH version but for larger shapes the overhead disappears.

```
import torch

shapes = [
    [1, 1],
    [100, 100],
    [1000, 1000],
    [10000, 10000],
    [100000, 100000],
]

for shape in shapes:
    x = torch.ones(shape)
    %timeit x.trace()

Before:
1.83 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.98 µs ± 48.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.19 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
85.2 µs ± 700 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.23 ms ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After:
2.16 µs ± 325 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
2.08 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.45 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.8 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.27 ms ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Future work
-----------
Things that can be done after this PR:
- add complex tensor support
- Fix the type promotion discrepancy between CPU and CUDA

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24683259

Pulled By: zou3519

fbshipit-source-id: f92b566ad0d58b72663ab64899d209c96edb78eb
2020-11-02 16:03:22 -08:00
Richard Zou
8054ae3e77 Add test for trace (#47125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47125

We didn't actually have any tests for torch.trace. The tests expose a
discrepancy between the behavior of torch.trace on CPU and CUDA that
I'll file an issue for.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24683260

Pulled By: zou3519

fbshipit-source-id: 71dd3af62bc98c6b9b0ba2bf2923cb6d44daa640
2020-11-02 16:00:33 -08:00
Brian Hirsh
b3eb0c86cf Revert D24335982: explicitly error out in comparison ops when the types don't match
Test Plan: revert-hammer

Differential Revision:
D24335982 (60fea510a1)

Original commit changeset: 3dfb02bcb403

fbshipit-source-id: 00072f1b00e228bbbe295053091cf4a7a46f4668
2020-11-02 14:08:01 -08:00
Xiong Wei
22b3d414de Enhance the torch.pow testcase for the complex scalar base (#47101)
Summary:
Related https://github.com/pytorch/pytorch/issues/45259

This PR is to address the https://github.com/pytorch/pytorch/pull/45259#discussion_r514390664

- leverage the `make_tensor`  function to generate a random tensor as the exponent, preventing the full zeros for the integer exponent.
- add some special cases for the zero exponents and the `1 + 0j` base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47101

Reviewed By: mruberry

Differential Revision: D24682430

Pulled By: zou3519

fbshipit-source-id: f559dc0ba08f37ae070036fb25a52ede17a24149
2020-11-02 13:13:15 -08:00
Brian Hirsh
60fea510a1 explicitly error out in comparison ops when the types don't match (#46399)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46399

Explicitly error out in comparison/logical ops when the dtypes of the various input/output tensors don't match. See [this comment](https://github.com/pytorch/pytorch/pull/46399#discussion_r505686406) for more details.

fixes #42660

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24335982

Pulled By: bdhirsh

fbshipit-source-id: 3dfb02bcb403dda5bcbf5ed3eae543354ad698b2
2020-11-02 11:42:32 -08:00
Nikita Shulga
edac4060d7 Fix mul cuda for bool (#47031)
Summary:
Also, add tests for tensor by scalar multiplication / division

Fixes https://github.com/pytorch/pytorch/issues/47007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47031

Reviewed By: walterddr

Differential Revision: D24608874

Pulled By: malfet

fbshipit-source-id: 4e15179904814d6e67228276d3d11ff1b5d15d0d
2020-10-30 10:38:32 -07:00
Heitor Schueroff
ddeacf1565 Fix median bug on discontigous tensors (#46917)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46917

fixes https://github.com/pytorch/pytorch/issues/46814

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D24633412

Pulled By: heitorschueroff

fbshipit-source-id: 54732671b298bdc2b04b13ab3a373892ee0933c3
2020-10-29 17:12:22 -07:00
Xiong Wei
74d730c0b5 implement NumPy-like functionality column_stack, row_stack (#46313)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

This PR implements `column_stack` as the composite ops of `torch.reshape` and `torch.hstack`, and makes `row_stack` as the alias of `torch.vstack`.

Todo

- [x] docs
- [x] alias pattern for `row_stack`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46313

Reviewed By: ngimel

Differential Revision: D24585471

Pulled By: mruberry

fbshipit-source-id: 62fc0ffd43d051dc3ecf386a3e9c0b89086c1d1c
2020-10-29 12:14:39 -07:00
mfkasim91
6eaa324c9f Implement torch.igamma (#46183)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41637
This is regularized lower incomplete gamma function, equivalent to scipy's `gammainc` and tensorflow `igamma`.

cc fritzo mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46183

Reviewed By: gchanan

Differential Revision: D24479126

Pulled By: mruberry

fbshipit-source-id: fdf8ea289fe4ca1b408810732192411e948fcdfe
2020-10-29 11:40:18 -07:00
Sameer Deshmukh
2249a293b7 Fix segfault with torch.orgqr. (#46700)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41768

The fault was that a NULL `tau` would get passed to LAPACK function. This PR fixes that by checking whether the `tau` contains 0 elements at the beginning of the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46700

Reviewed By: albanD

Differential Revision: D24616427

Pulled By: mruberry

fbshipit-source-id: 92e8f1489b113c0ceeca6e54dea8b810a51a63c3
2020-10-29 10:34:39 -07:00
Kurt Mohler
b75b961934 Fix requires_grad arg for new_full, new_empty, new_zeros (#46486)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/36455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46486

Reviewed By: gchanan

Differential Revision: D24497034

Pulled By: ezyang

fbshipit-source-id: 769a7f00f9a8f7cb77273a1193173a837ae7e32f
2020-10-28 09:34:53 -07:00
kiyosora
53839ac9d7 Fix internal assert for torch.heaviside with cuda tensor and cpu scalar tensor (#46831)
Summary:
Fixed https://github.com/pytorch/pytorch/issues/46681

```
>>> x = torch.randn(10, device='cuda')
>>> y = torch.tensor(1.)
>>> torch.heaviside(x, y)
tensor([0., 1., 0., 1., 1., 0., 1., 1., 1., 0.], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46831

Reviewed By: navahgar

Differential Revision: D24567953

Pulled By: izdeby

fbshipit-source-id: e5fcf4355b27ce0bdf434963d01863d3b24d0bea
2020-10-27 16:47:33 -07:00
Hong Xu
bcbb6baccf Add a warning message that torch.sign would not support complex numbers (#43280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43280

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24538769

Pulled By: anjali411

fbshipit-source-id: ab2d5283501e4c1d7d401d508e32f685add7ebb1
2020-10-26 21:13:12 -07:00
Xiang Gao
7731370e71 CUDA BFloat16 gelu, hardswish, hardsigmoid (#44997)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44997

Reviewed By: izdeby

Differential Revision: D24547748

Pulled By: ngimel

fbshipit-source-id: 34639dfe6ca41c3f59fd2af861e5e3b1bb86757a
2020-10-26 16:01:22 -07:00
Xiang Gao
99cf3b1ce4 CUDA BFloat16 signal windows (#45155)
Summary:
Looks like this op is never tested for the support of different dtypes?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45155

Reviewed By: zou3519

Differential Revision: D24438839

Pulled By: ngimel

fbshipit-source-id: 103ff609e11811a0705d04520c2b97c456b623ef
2020-10-26 15:53:30 -07:00
Alexander Grund
93719440b8 Replace map(lambda constructs (#46462)
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal

Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462

Reviewed By: zou3519

Differential Revision: D24422343

Pulled By: ezyang

fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
2020-10-22 09:50:22 -07:00
Pearu Peterson
905ed3c840 Revised sparse tensor documentation. (#45400)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44635.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45400

Reviewed By: ezyang

Differential Revision: D24359410

Pulled By: mruberry

fbshipit-source-id: 37c691a49a7b0042c7a298e0ed1226702b097c8b
2020-10-22 02:07:54 -07:00
Xiao Wang
fe4f90c40b Cusolver inverse check info (#46625)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46625

Reviewed By: zou3519

Differential Revision: D24438577

Pulled By: ngimel

fbshipit-source-id: d00e6eb2eae4aa39ca6ecf5914fe9cf37c24b906
2020-10-21 21:46:33 -07:00
lixinyu
a651b876a7 preserve non-dense or overlapping tensor's layout in *_like functions (#46046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46046

*_like functions are used in pytorch to create a new tensor with the same shape of the input tensor. But we don’t always preserve the layout permutation of the tensor. Current behavior is that, for a dense and non-overlapping tensor, its layout permutation is preserved. For eg.  passing a channel last contiguous tensor t with ‘shape/stride’  (2, 4, 3, 2)/(24, 1, 8, 4) to empty_like(t) function will create a new tensor with exactly the same ‘shape/stride’ as the input tensor t. However, if the input tensor is non-dense or has overlap, we simply create a contiguous tensor based on input tensor’s shape, so the tensor layout permutation is lost.

This PR preserves the layout permutation for non-dense or overlapping tensor. The strides propagation rule that used in this PR is exactly the same as what is being used in TensorIterator.  The behavior changes are listed below:

| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

This is to solve the non-dense tensor layout problem in #45505

TODO:
- [x] Fix all the BC broken test cases in pytorch
- [ ] Investigate if any fb internal tests are broken

This change will cover all kinds of non-dense tensors.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24288970

Pulled By: glaringlee

fbshipit-source-id: 320fd4e0d1a810a12abfb1441472298c983a368d
2020-10-20 19:49:49 -07:00
Kurt Mohler
e6ed887908 Add view test for tensor_split (#46427)
Summary:
Fulfills Mike's suggestion here: https://github.com/pytorch/pytorch/pull/44868#discussion_r505095018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46427

Reviewed By: ezyang

Differential Revision: D24355107

Pulled By: mruberry

fbshipit-source-id: bddef2f9c2c41b5c5ac47a17d5ecdda580072e99
2020-10-20 09:56:37 -07:00
Alexander Grund
5b0f400488 Replace list(map(...)) constructs by list comprehensions (#46461)
Summary:
As discussed in https://github.com/pytorch/pytorch/issues/46392 this makes the code more readable and possibly more performant.

It also fixes a bug detected by this where the argument order of `map` was confused: 030a24906e (diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98L1537-R1537)

Fixes https://github.com/pytorch/pytorch/issues/46392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46461

Reviewed By: ailzhang

Differential Revision: D24367015

Pulled By: ezyang

fbshipit-source-id: d55a67933cc22346b00544c9671f09982ad920e7
2020-10-19 18:42:49 -07:00
Ailing Zhang
8c629ecc9a [WIP] Move catchAll to Math (#45939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45939

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D24165890

Pulled By: ailzhang

fbshipit-source-id: 72fe71ea95a738251b2fafc9eea4ab3831cf426b
2020-10-16 16:17:16 -07:00
Nikita Vedeneev
9300a27702 Make torch.lu support complex input on CUDA. (#45898)
Summary:
As per title. LU decomposition is used for computing determinants, and I need this functionality to implement the matrix square root. Next PR on my list is to enable `torch.det` on CUDA with complex input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45898

Reviewed By: heitorschueroff

Differential Revision: D24306951

Pulled By: anjali411

fbshipit-source-id: 168f578fe65ae1b978617a66741aa27e72b2172b
2020-10-16 10:29:39 -07:00
Jane Xu
c99378af1b Fixing pow for special case between cuda tensors and cpu tensors and reframed test cases a tiny bit (#46320)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I now isolated the special case to be only between cuda tensor bases and cpu tensor exponents. My previous fix was not a complete fix--it fixed some stuff but broke others. The current fix is a more complete fix:
```
In [1]: import torch
In [2]: a=torch.randn(3)
In [3]: b=torch.tensor(2, device="cuda")
In [4]: torch.pow(a,b) #should not work and throws exception now!

In [5]: a=torch.tensor(3, device="cuda")
In [6]: b=torch.tensor(2)
In [7]: torch.pow(a,b) #should work, and now does

In [8]: a=torch.randn(3, device="cuda")
In [9]: torch.pow(a,b) # yeah, that one is fixed and still works
```

To add a test case to reflect the change, I had to modify the existing setup a little bit. I think it is an improvement but would appreciate any tips on how to make it better!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46320

Reviewed By: malfet

Differential Revision: D24306610

Pulled By: janeyx99

fbshipit-source-id: cc74c61373d1adc2892a7a31226f38895b83066a
2020-10-15 13:43:47 -07:00
Ivan Yashchuk
c1141b6f68 Added support for complex torch.pinverse (#45819)
Summary:
This PR adds support for complex-valued input for `torch.pinverse`.
Fixed cuda SVD implementation to return singular values with real dtype.

Fixes https://github.com/pytorch/pytorch/issues/45385.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45819

Reviewed By: heitorschueroff

Differential Revision: D24306539

Pulled By: anjali411

fbshipit-source-id: 2fe19bc630de528e0643132689e1bc5ffeaa162a
2020-10-15 12:28:22 -07:00
Xiang Gao
5ce46fbbca BFloat16 support for torch.sign (#45244)
Summary:
Added BF16 support for torch.sign on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45244

Reviewed By: zou3519

Differential Revision: D23932304

Pulled By: izdeby

fbshipit-source-id: e50b9510ecf2337ec0288392d6950046116b2599
2020-10-15 12:23:14 -07:00
Jane Xu
ad376f1a62 trying to make pow work for tensor raised to the power of a scalar (#46185)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46037

I'm not sure this is the most performant solution, but this works:

torch.pow(cuda_tensor, 5) should work and worked before.
torch.pow(cuda_tensor, torch.tensor(5)), should work **and works now!**
torch.pow(cuda_tensor, torch.tensor((5,))), should NOT work and complain the tensors are on different devices and indeed continues to complain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46185

Reviewed By: glaringlee, malfet

Differential Revision: D24257687

Pulled By: janeyx99

fbshipit-source-id: 2daf235d62ec5886d7c153da05445c2ec71dec98
2020-10-13 10:14:36 -07:00
Erjia Guan
bed3b40523 Implement ravel (#46098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46098

Doc:
![image](https://user-images.githubusercontent.com/68879799/95611323-ae5cf380-0a2f-11eb-9b8e-56bf79ce68af.png)

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24253213

Pulled By: ejguan

fbshipit-source-id: 42a866c902272cbe3743a9d0cb3afb9165d51c0b
2020-10-12 16:00:44 -07:00
kshitij12345
a814231616 [fix] torch.kthvalue : handle non-contiguous CUDA tensor (#45802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45721

TODO
* [x] Test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45802

Reviewed By: ngimel

Differential Revision: D24236706

Pulled By: mruberry

fbshipit-source-id: 5a51049233efa710f9500a6f7d099c90d43062c9
2020-10-11 20:13:08 -07:00
Kurt Mohler
a0a8bc8870 Fix mistakes and increase clarity of norm documentation (#42696)
Summary:
* Removes incorrect statement that "the vector norm will be applied to the last dimension".
* More clearly describe each different combination of `p`, `ord`, and input size.
* Moves norm tests from `test/test_torch.py` to `test/test_linalg.py`
* Adds test ensuring that `p='fro'` and `p=2` give same results for mutually valid inputs

Fixes https://github.com/pytorch/pytorch/issues/41388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42696

Reviewed By: bwasti

Differential Revision: D23876862

Pulled By: mruberry

fbshipit-source-id: 36f33ccb6706d5fe13f6acf3de8ae14d7fbdff85
2020-10-10 14:12:43 -07:00
Nikita Shulga
f363a2e106 Mark top 3 slowest tests as slow (#46068)
Summary:
`TCPStoreTest.test_numkeys_delkeys` takes 5+ min (mostly in idle wait for socket timeout)
`TestDataLoader.test_proper_exit` and `TestDataLoaderPersistentWorkers.test_proper_exit` take 2.5 min each
`TestXNNPACKConv1dTransformPass.test_conv1d_with_relu_fc` takes 2 min to finish

Add option to skip reporting test classes that run for less than a second to `print_test_stats.py` and speed up `TestTorchDeviceTypeCUDA.test_matmul_45724_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46068

Reviewed By: mruberry

Differential Revision: D24208660

Pulled By: malfet

fbshipit-source-id: 780e0d8be4f0cf69ea28de79e423291a1f3349b7
2020-10-08 21:10:03 -07:00
Ivan Yashchuk
f010df35e5 Added CUDA support for complex input for QR decomposition (#45032)
Summary:
QR decomposition now works for complex inputs on GPU.

Ref. https://github.com/pytorch/pytorch/issues/33152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45032

Reviewed By: ailzhang

Differential Revision: D24199105

Pulled By: anjali411

fbshipit-source-id: 249552b31fd713446e609b66e508ac54b817b98e
2020-10-08 13:24:21 -07:00
Heitor Schueroff de Souza
636eb18029 Fixed median nan propagation and implemented nanmedian (#45847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847

Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D24136629

Pulled By: heitorschueroff

fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9
2020-10-08 11:20:21 -07:00
Kurt Mohler
ef4817fe5a Add tensor_split function, based on numpy.array_split (#45168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/9382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45168

Reviewed By: ngimel

Differential Revision: D24166164

Pulled By: mruberry

fbshipit-source-id: 795459821e52885bc99623a01a2abec060995ce6
2020-10-07 23:14:48 -07:00
Xiang Gao
b2bff9e431 Workaround for cublas bug for 45724 (#46001)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46001

Reviewed By: mruberry

Differential Revision: D24184058

Pulled By: ngimel

fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9
2020-10-07 22:38:19 -07:00
Your Name
c59c4b0d77 Fix cholesky TF32 tests (#45492)
Summary:
This test is changed one day before the landing of the tf32 tests PR, therefore the fix for this is not included in that PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45492

Reviewed By: ezyang

Differential Revision: D24101876

Pulled By: ngimel

fbshipit-source-id: cb3615b2fb8acf17abe54cd18b1faec26582d6b6
2020-10-07 20:42:06 -07:00
Xiang Gao
903acc6b83 CUDA BFloat16 support of clamp, remainder, lshift, rshift (#45247)
Summary:
Add CUDA BFloat16 support of clamp, remainder, lshift, rshift

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45247

Reviewed By: dzhulgakov

Differential Revision: D24174258

Pulled By: ngimel

fbshipit-source-id: bfcd2d1b3746bb0527d590533f3c38b9c4d0a638
2020-10-07 20:37:06 -07:00
Vaidotas Simkus
e154b36685 Standardized clamp kernels to Numpy-like implementation (#43288)
Summary:
**BC-breaking note**

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_double.h (L304)

but in other places it clamps differently:

78b95b6204/aten/src/ATen/cpu/vec256/vec256_base.h (L624)

78b95b6204/aten/src/ATen/native/cuda/UnaryOpsKernel.cu (L160)

These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

```
t = torch.arange(200).to(torch.float)
torch.clamp(t, 4, 2)[0]
: tensor(2.)

torch.clamp(t.cuda(), 4, 2)[0]
: tensor(4., device='cuda:0')

torch.clamp(torch.tensor(0), 4, 2)
: tensor(4)
```

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.

**PR Summary**

Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.

The same fix as in https://github.com/pytorch/pytorch/issues/32587 but isolated to the kernel change only, so that the internal team can benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43288

Reviewed By: colesbury

Differential Revision: D24079453

Pulled By: mruberry

fbshipit-source-id: 67f30d2f2c86bbd3e87080b32f00e8fb131a53f7
2020-10-06 13:42:08 -07:00
KyleCZH
a9a9d0b181 Rocm skip test cases (#45782)
Summary:
Skip the following test cases for rocm (When PYTORCH_TEST_WITH_ROCM=1):
- test_reference_numerics_tan_cuda_float64 (__main__.TestUnaryUfuncsCUDA)
- test_addmv_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)
- test_logspace_cuda_float64 (__main__.TestTensorCreationCUDA)
- test_gloo_backend_2gpu_module (__main__.DistributedDataParallelTest)
jeffdaily
pruthvistony

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45782

Reviewed By: VitalyFedyunin

Differential Revision: D24115581

Pulled By: xw285cornell

fbshipit-source-id: 4043a9fa19e242301b5007813c15b6b3873889c5
2020-10-05 15:12:25 -07:00
Xiang Gao
e1ff46b6e5 CUDA BFloat16 TopK (#44755)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44755

Reviewed By: mruberry

Differential Revision: D23741680

Pulled By: ngimel

fbshipit-source-id: 8fce92a26663336bcb831c72202fe2623a2ddaf0
2020-10-04 11:38:00 -07:00
Nikita Shulga
3a27fc966a Test torch.svd using complex float and double numbers (take 2) (#45795)
Summary:
Adds support for magmaSvd for complex numbers

Fixes use-after-free error in `apply_symeig`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45795

Reviewed By: ezyang

Differential Revision: D24096955

Pulled By: malfet

fbshipit-source-id: 0d8d8492f89fe722bbd5aed3528f244245b496d0
2020-10-03 11:33:28 -07:00
Nikita Shulga
5a47a2126d Revert D24018160: [pytorch][PR] Test torch.svd using complex float and double numbers
Test Plan: revert-hammer

Differential Revision:
D24018160 (888f3c12e7)

Original commit changeset: 1b6103f5af94

fbshipit-source-id: 3040250db25995fc0d41fd0f497550dded43cad9
2020-10-02 13:33:11 -07:00
Nikita Shulga
888f3c12e7 Test torch.svd using complex float and double numbers (#45572)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45572

Reviewed By: anjali411

Differential Revision: D24018160

Pulled By: malfet

fbshipit-source-id: 1b6103f5af94e9f74b73ed23aa02c0236b199b34
2020-10-02 08:29:14 -07:00
Ivan Yashchuk
77cd8e006b Added support for complex torch.symeig (#45121)
Summary:
This PR adds support for complex-valued input for `torch.symeig`.

TODO:
- [ ] complex cuda tests raise `RuntimeError: _th_bmm_out not supported on CUDAType for ComplexFloat`
Update: Added xfailing tests for complex dtypes on CUDA. Once support for complex `bmm` is added these tests will work.

Fixes https://github.com/pytorch/pytorch/issues/45061.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45121

Reviewed By: mrshenli

Differential Revision: D24049649

Pulled By: anjali411

fbshipit-source-id: 2cd11f0e47d37c6ad96ec786762f2da57f25dac5
2020-10-01 08:57:13 -07:00
Nikita Shulga
c87ff2cb90 Enable transposed tensor copy for complex types (#45487)
Summary:
This enables a special copy operator for transposed tensors with more than 360 elements:
417e3f85e5/aten/src/ATen/native/Copy.cpp (L19)

Steps to repro: python -c "import torch; print(torch.svd(torch.randn(61, 61, dtype=torch.complex64)))"

Fixes https://github.com/pytorch/pytorch/issues/45269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45487

Reviewed By: anjali411

Differential Revision: D23984441

Pulled By: malfet

fbshipit-source-id: 10ce1d5f4425fb6de78e96adffd119e545b6624f
2020-09-29 19:22:05 -07:00
Mike Ruberry
b66ac1e928 Updates nonzero's as_tuple behavior to no longer warn. (#45413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44284.

[torch.nonzero](https://pytorch.org/docs/master/generated/torch.nonzero.html?highlight=nonzero#torch.nonzero) is distinct from [numpy.nonzero](https://numpy.org/doc/1.18/reference/generated/numpy.nonzero.html?highlight=nonzero#numpy.nonzero). The latter returns a tensor by default, and the former returns a tuple of tensors. The `as_tuple` argument was added as part of an intended deprecation process to make torch.nonzero consistent with numpy.nonzero, but this was a confusing change for users. A better deprecation path would be to offer torch.argwhere consistent with [numpy.argwhere](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html?highlight=argwhere#numpy.argwhere), which is equivalent to the default torch.nonzero behavior. Once this is offered a change to torch.nonzero should be more straightforward with less user disruption, if we decided that's the correct change to pursue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45413

Reviewed By: ngimel

Differential Revision: D23975015

Pulled By: mruberry

fbshipit-source-id: b59237d0d8c2df984e952b62d0a7c247b49d84dc
2020-09-29 12:16:59 -07:00
Mike Ruberry
b2925671b6 Updates deterministic flag to throw a warning, makes docs consistent (#45410)
Summary:
Per feedback in the recent design review. Also tweaks the documentation to clarify what "deterministic" means and adds a test for the behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45410

Reviewed By: ngimel

Differential Revision: D23974988

Pulled By: mruberry

fbshipit-source-id: e48307da9c90418fc6834fbd67b963ba2fe0ba9d
2020-09-29 11:17:33 -07:00
Hong Xu
15f85eea18 Support bfloat16 and complex dtypes for logical_not (#43537)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43537

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23751950

Pulled By: mruberry

fbshipit-source-id: d07ecd9aae263eb8e00928d4fc981e0d66066fbb
2020-09-29 11:00:05 -07:00
Mike Ruberry
6d37126a10 Makes rdiv consistent with div (#45407)
Summary:
In addition to making rdiv consistent with div, this PR significantly expands division testing, accounting for floor_divide actually performing truncation division, too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45407

Reviewed By: ngimel

Differential Revision: D23974967

Pulled By: mruberry

fbshipit-source-id: 82b46b07615603f161ab7cd1d3afaa6d886bfe95
2020-09-29 08:34:01 -07:00
Himangshu
7cde662f08 Add check for Complex Type to allow non integral alpha. (#45200)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45200

Reviewed By: gchanan

Differential Revision: D23940134

Pulled By: anjali411

fbshipit-source-id: cce7b1efc22ec189ba6c83e31ce712bb34997139
2020-09-29 07:36:46 -07:00
anjali411
534f2ae582 Disable inplace abs for complex tensors (#45069)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45069

`torch.abs` is a `C -> R` function for complex input. Following the general semantics in torch, the in-place version of abs should be disabled for complex input.

Test Plan: Imported from OSS

Reviewed By: glaringlee, malfet

Differential Revision: D23818397

Pulled By: anjali411

fbshipit-source-id: b23b8d0981c53ba0557018824d42ed37ec13d4e2
2020-09-28 20:33:35 -07:00
Xiong Wei
0c8a6008ac Fix torch.pow when the scalar base is a complex number (#45259)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45259

Reviewed By: gchanan

Differential Revision: D23962073

Pulled By: anjali411

fbshipit-source-id: 1b16afbb98f33fa7bc53c6ca296c5ddfcbdd2b72
2020-09-28 18:25:53 -07:00
Xiang Gao
36c3fbc9e3 CUDA BFloat Conv (non-cuDNN) (#45007)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45007

Reviewed By: zou3519

Differential Revision: D23933174

Pulled By: ngimel

fbshipit-source-id: 84eb028f09c9197993fb9981c0efb535014e5f78
2020-09-28 11:42:42 -07:00
Mike Ruberry
8bdbedd4ee Revert "Updates and simplifies nonzero as_tuple behavior"
This reverts commit 8b143771d0.
2020-09-27 20:58:42 -07:00
Mike Ruberry
8b143771d0 Updates and simplifies nonzero as_tuple behavior 2020-09-27 20:56:30 -07:00
Xiong Wei
241afc9188 Migrate addr from the TH to Aten (CPU) (#44364)
Summary:
Related https://github.com/pytorch/pytorch/issues/24507
Fixes https://github.com/pytorch/pytorch/issues/24666

This PR is to modernize the CPU implementation of the vector `outer product`.
The existing TH implementation for `torch.attr` is migrated to `aten`, as the `torch.ger` manipulates the `addr` functions to calculate outer product,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44364

Reviewed By: ezyang

Differential Revision: D23866733

Pulled By: mruberry

fbshipit-source-id: 5159ea22f0e3c991123fe7c19cc9beb6ad00301e
2020-09-25 01:18:09 -07:00
Gao, Xiang
3f5eee666c Adjust TF32 tests (#44240)
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
2020-09-24 10:25:58 -07:00
Hong Xu
b470fa4500 Add complex number support for binary logical operators (#43174)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43174

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684425

Pulled By: mruberry

fbshipit-source-id: 4857b16e18ec4c65327136badd7f04c74e32d330
2020-09-23 23:03:00 -07:00
kshitij12345
0b6b735863 [fix] type promotion atan2 (#43466)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43466

Reviewed By: malfet

Differential Revision: D23834928

Pulled By: mruberry

fbshipit-source-id: 2e7e0b4fcf1a846efc171c275d65a6daffd3c631
2020-09-23 22:23:05 -07:00
Ailing Zhang
9db3871288 Update true_divide_out to use at::. (#45079)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45079

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23821701

Pulled By: ailzhang

fbshipit-source-id: 562eac10faba7a503eda0029a0b026c1fb85fe1e
2020-09-23 10:50:48 -07:00
Ivan Yashchuk
5b20bf4fd9 Added support for complex input for Cholesky decomposition (#44895)
Summary:
Cholesky decomposition now works for complex inputs.

Fixes https://github.com/pytorch/pytorch/issues/44637.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44895

Reviewed By: ailzhang

Differential Revision: D23841583

Pulled By: anjali411

fbshipit-source-id: 3b1f34a7af17827884540696f8771a0d5b1df478
2020-09-23 08:25:56 -07:00
Xiang Gao
144dacd8d9 CUDA BFloat16 batched gemm (#45167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45167

Reviewed By: mruberry

Differential Revision: D23860458

Pulled By: ngimel

fbshipit-source-id: 698de424a046963a30017b58d227fa510f85bf3f
2020-09-22 22:43:52 -07:00
Hong Xu
e2b40ce793 Support BFloat16 for binary logical operators on CUDA (#42485)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42485

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684423

Pulled By: mruberry

fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428
2020-09-22 11:42:34 -07:00
anjali411
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
Gao, Xiang
dfb8f2d51f CUDA BFloat16 addmm, addmv (#44986)
Summary:
This PR was originally authored by slayton58. I steal his implementation and added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44986

Reviewed By: mruberry

Differential Revision: D23806039

Pulled By: ngimel

fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead
2020-09-21 14:28:27 -07:00
Xiang Gao
581a364437 CUDA BFloat16 unary ops part 1 (#44813)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44813

Reviewed By: mruberry

Differential Revision: D23805816

Pulled By: ngimel

fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64
2020-09-21 14:22:31 -07:00
Hong Xu
49db7b59e0 For logical tests, use the dtypes decorator (#42483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42483

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23684424

Pulled By: mruberry

fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852
2020-09-19 19:01:49 -07:00
Xiao Wang
d75c402755 Add cusolver to build, rewrite MAGMA inverse with cusolver (#42403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42265

This PR adds cusolver to the pytorch build, and enables the use of cusolver/cublas library functions on GPU `torch.inverse` on certain tensor shapes.

Specifically, when

* the tensor is two dimensional (single batch), or
* has >2 dimensions (multiple batches) and `batch_size <= 2`, or
* magma is not linked,

cusolver/cublas will be used. In other conditions, the current implementation of MAGMA will still be used.

8c0949ae45/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu (L742-L752)

The reason for this is that for tensors with large batch_size, `cublasXgetrfBatched` and `cublasXgetriBatched` doesn't perform very well. For `batch_size > 1`, we launch cusolver functions in multiple streams. This lets cusolver functions run in parallel, and can greatly increase the performance. When `batch_size > 2`, the parallel launched cusolver functions are slightly slower than the current magma implementation, so we still use the current magma impl.

On CUDA 9.2, there were some numerical issues detected, so cusolver impl will not be used. The cusolver impl will also not be used on platforms other than Nvidia CUDA.

060769feaf/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.h (L10-L13)

Note that there is a new heuristic used before cusolver/cublas calls here:

8c0949ae45/aten/src/ATen/native/cuda/MiscUtils.h (L113-L121)

where `use_loop_launch = true` means launch single batch cusolver functions in parallel, and `use_loop_launch = false` means use cublas_X_batched functions. When magma is enabled (only `batch_size <= 2` will be dispatched to cusolver/cublas), the heuristic will always return `true` and the cusolver calls are faster than small batch_size magma calls. When magma is disabled, this adds the functionality of `torch.inverse`, which was disabled before for all shapes (though large batch_size cublas performance may not be as well as magma).

Checklist:
- [X] Add benchmark, cpu, gpu-before (magma), gpu-after (cusolver)
- [X] Rewrite single inverse (ndim == 2) with cusolver
- [X] Rewrite batched inverse (ndim > 2) with cublas
- [X] Add cusolver to build
- [x] Clean up functions related to `USE_MAGMA` define guard
- [x] Workaround for non-cuda platform
- [x] Workaround for cuda 9.2
- [x] Add zero size check
- [x] Add tests

Next step:

If cusolver doesn't cause any problem in pytorch build, and there are no major performance regressions reported after this PR being merged, I will start porting other cusolver/cublas functions for linear algebra to improve the performance.

<details>
<summary> benchmark 73499c6 </summary>

benchmark code: https://github.com/xwang233/code-snippet/blob/master/torch.inverse/inverse-cusolver.ipynb

shape meaning:

* `[] 2 torch.float32 -> torch.randn(2, 2, dtype=torch.float32)`
* `[2] 4 torch.float32 -> torch.randn(2, 4, 4, dtype=torch.float32)`

| shape | cpu_time (ms) | gpu_time_before (magma) (ms) | gpu_time_after (ms) |
| --- | --- | --- | --- |
| [] 2 torch.float32 |  0.095 |  7.534 |  0.129  |
| [] 4 torch.float32 |  0.009 |  7.522 |  0.129  |
| [] 8 torch.float32 |  0.011 |  7.647 |  0.138  |
| [] 16 torch.float32 |  0.075 |  7.582 |  0.135  |
| [] 32 torch.float32 |  0.073 |  7.573 |  0.191  |
| [] 64 torch.float32 |  0.134 |  7.694 |  0.288  |
| [] 128 torch.float32 |  0.398 |  8.073 |  0.491  |
| [] 256 torch.float32 |  1.054 |  11.860 |  1.074  |
| [] 512 torch.float32 |  5.218 |  14.130 |  2.582  |
| [] 1024 torch.float32 |  19.010 |  18.780 |  6.936  |
| [1] 2 torch.float32 |  0.009 |  0.113 |  0.128 ***regressed |
| [1] 4 torch.float32 |  0.009 |  0.113 |  0.131 ***regressed |
| [1] 8 torch.float32 |  0.011 |  0.116 |  0.129 ***regressed |
| [1] 16 torch.float32 |  0.015 |  0.122 |  0.135 ***regressed |
| [1] 32 torch.float32 |  0.032 |  0.177 |  0.178 ***regressed |
| [1] 64 torch.float32 |  0.070 |  0.420 |  0.281  |
| [1] 128 torch.float32 |  0.328 |  0.816 |  0.490  |
| [1] 256 torch.float32 |  1.125 |  1.690 |  1.084  |
| [1] 512 torch.float32 |  4.344 |  4.305 |  2.576  |
| [1] 1024 torch.float32 |  16.510 |  16.340 |  6.928  |
| [2] 2 torch.float32 |  0.009 |  0.113 |  0.186 ***regressed |
| [2] 4 torch.float32 |  0.011 |  0.115 |  0.184 ***regressed |
| [2] 8 torch.float32 |  0.012 |  0.114 |  0.184 ***regressed |
| [2] 16 torch.float32 |  0.019 |  0.119 |  0.173 ***regressed |
| [2] 32 torch.float32 |  0.050 |  0.170 |  0.240 ***regressed |
| [2] 64 torch.float32 |  0.120 |  0.429 |  0.375  |
| [2] 128 torch.float32 |  0.576 |  0.830 |  0.675  |
| [2] 256 torch.float32 |  2.021 |  1.748 |  1.451  |
| [2] 512 torch.float32 |  9.070 |  4.749 |  3.539  |
| [2] 1024 torch.float32 |  33.655 |  18.240 |  12.220  |
| [4] 2 torch.float32 |  0.009 |  0.112 |  0.318 ***regressed |
| [4] 4 torch.float32 |  0.010 |  0.115 |  0.319 ***regressed |
| [4] 8 torch.float32 |  0.013 |  0.115 |  0.320 ***regressed |
| [4] 16 torch.float32 |  0.027 |  0.120 |  0.331 ***regressed |
| [4] 32 torch.float32 |  0.085 |  0.173 |  0.385 ***regressed |
| [4] 64 torch.float32 |  0.221 |  0.431 |  0.646 ***regressed |
| [4] 128 torch.float32 |  1.102 |  0.834 |  1.055 ***regressed |
| [4] 256 torch.float32 |  4.042 |  1.811 |  2.054 ***regressed |
| [4] 512 torch.float32 |  18.390 |  4.884 |  5.087 ***regressed |
| [4] 1024 torch.float32 |  69.025 |  19.840 |  20.000 ***regressed |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42403

Reviewed By: ailzhang, mruberry

Differential Revision: D23717984

Pulled By: ngimel

fbshipit-source-id: 54cbd9ea72a97989cff4127089938e8a8e29a72b
2020-09-18 20:43:29 -07:00
Gao, Xiang
e255a4e1fd Enable bfloat16 random kernels on Windows (#44918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44918

Reviewed By: pbelevich

Differential Revision: D23777548

Pulled By: ngimel

fbshipit-source-id: 9cf13166d7deba17bc72e402b82ed0afe347cb9b
2020-09-18 15:55:32 -07:00
Xiang Gao
7bd8a6913d CUDA BFloat div, addcdiv, addcmul, mean, var (#44758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44758

Reviewed By: mruberry

Differential Revision: D23752317

Pulled By: ngimel

fbshipit-source-id: 77992cf991f4e2b4b6839de73ea7e6ce2e1061c6
2020-09-18 11:51:11 -07:00
Xiang Gao
f5440a448a CUDA BFloat16 i0 support (#44750)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44750

Reviewed By: glaringlee

Differential Revision: D23764383

Pulled By: ngimel

fbshipit-source-id: d0e784d89241e8028f97766fdac51fe1ab4c188c
2020-09-17 13:30:10 -07:00
Xiang Gao
c189328e5d CUDA BFloat16 unary ops part 2 (#44824)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44824

Reviewed By: mruberry

Differential Revision: D23752360

Pulled By: ngimel

fbshipit-source-id: 3aadaf9db9d4e4937aa38671e8589ecbeece709d
2020-09-17 10:57:43 -07:00
vfdev
24df3b7373 torch.empty_like and torch.zeros_like raise error if any memory format is provided with sparse input (#43699) (#44058)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43699

- Changed the order of `TORCH_CHECK` and `if (options.layout() == kSparse && self.is_sparse())`
inside `empty_like` method.

- [x] Added tests

EDIT:

More details on that and why we can not take zeros_like  approach.
Python code :
```python
res = torch.zeros_like(input_coalesced, memory_format=torch.preserve_format)
```
is routed to
```c++
// TensorFactories.cpp
Tensor zeros_like(
    const Tensor& self,
    const TensorOptions& options,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  if (options.layout() == kSparse && self.is_sparse()) {
    auto res = at::empty({0}, options); // to be resized
    res.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return res;
  }
  auto result = at::empty_like(self, options, optional_memory_format);
  return result.zero_();
}
```
and passed to `if (options.layout() == kSparse && self.is_sparse())`

When we call in Python
```python
res = torch.empty_like(input_coalesced, memory_format=torch.preserve_format)
```
it is routed to
```c++
Tensor empty_like(
    const Tensor& self,
    const TensorOptions& options_,
    c10::optional<c10::MemoryFormat> optional_memory_format) {
  TORCH_CHECK(
    !(options_.has_memory_format() && optional_memory_format.has_value()),
    "Cannot set memory_format both in TensorOptions and explicit argument; please delete "
    "the redundant setter.");
  TensorOptions options =
      self.options()
          .merge_in(options_)
          .merge_in(TensorOptions().memory_format(optional_memory_format));
  TORCH_CHECK(
      !(options.layout() != kStrided &&
          optional_memory_format.has_value()),
      "memory format option is only supported by strided tensors");
  if (options.layout() == kSparse && self.is_sparse()) {
    auto result = at::empty({0}, options); // to be resized
    result.sparse_resize_and_clear_(
        self.sizes(), self.sparse_dim(), self.dense_dim());
    return result;
  }
```

cc pearu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44058

Reviewed By: albanD

Differential Revision: D23672494

Pulled By: mruberry

fbshipit-source-id: af232274dd2b516dd6e875fc986e3090fa285658
2020-09-17 10:25:31 -07:00
Heitor Schueroff de Souza
28085cbd39 Fixed quantile nan propagation and implemented nanquantile (#44393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44393

torch.quantile now correctly propagates nan and implemented torch.nanquantile similar to numpy.nanquantile.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D23649613

Pulled By: heitorschueroff

fbshipit-source-id: 5201d076745ae1237cedc7631c28cf446be99936
2020-09-17 05:53:25 -07:00
Sameer Deshmukh
e18a2219dd Implement scatter reductions (CUDA), remove divide/subtract (#41977)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33394 .

This PR does two things:
1. Implement CUDA scatter reductions with revamped GPU atomic operations.
2. Remove support for divide and subtract for CPU reduction as was discussed with ngimel .

I've also updated the docs to reflect the existence of only multiply and add.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41977

Reviewed By: mruberry

Differential Revision: D23748888

Pulled By: ngimel

fbshipit-source-id: ea643c0da03c9058e433de96db02b503514c4e9c
2020-09-16 23:25:21 -07:00
Muthu Arivoli
b61d3d8be8 Implement torch.kaiser_window (#44271)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44271

Reviewed By: ngimel

Differential Revision: D23727972

Pulled By: mruberry

fbshipit-source-id: b4c931b2eb3a536231ad6d6c3cb66e52a13286ac
2020-09-16 20:41:31 -07:00
Xiang Gao
34331b0e0f CUDA BFloat16 and other improvements on abs (#44804)
Summary:
Not sure if ROCm supports `std::abs` today, let's see the CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44804

Reviewed By: mruberry

Differential Revision: D23748837

Pulled By: ngimel

fbshipit-source-id: ccf4e63279f3e5927a85d8d8f70ba4b8c334156b
2020-09-16 20:37:07 -07:00
Ivan Yashchuk
07d9cc80a4 Fix error code checks for triangular_solve (CPU) (#44720)
Summary:
Added missing error checks for the CPU version of `triangular_solve`.
Fixes https://github.com/pytorch/pytorch/issues/43141.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44720

Reviewed By: mruberry

Differential Revision: D23733400

Pulled By: ngimel

fbshipit-source-id: 9837e01b04a6bfd9181e08d46bf96329f292cae0
2020-09-16 13:54:45 -07:00
Natalia Gimelshein
e6101f5507 fixes lda condition for blas functions, fixes bug with beta=0 in addmv slow path (#44681)
Summary:
per title. If `beta=0` and slow path was taken, `nan` and `inf` in the result were not masked as is the case with other linear algebra functions. Similarly, since `mv` is implemented as `addmv` with `beta=0`, wrong results were sometimes produced for `mv` slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44681

Reviewed By: mruberry

Differential Revision: D23708653

Pulled By: ngimel

fbshipit-source-id: e2d5d3e6f69b194eb29b327e1c6f70035f3b231c
2020-09-16 11:47:56 -07:00
Xiang Gao
ee493e1a91 CUDA bfloat compare ops (#44748)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44748

Reviewed By: mruberry

Differential Revision: D23725997

Pulled By: ngimel

fbshipit-source-id: 4f89dce3a8b8f1295ced522011b59e60d756e749
2020-09-16 11:32:14 -07:00
Xiang Gao
06036f76b6 CUDA BFloat16 pow (#44760)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44760

Reviewed By: ngimel

Differential Revision: D23727936

Pulled By: mruberry

fbshipit-source-id: 8aa89e989294347d7f593b1a63ce4a1dbfdf783e
2020-09-16 10:01:21 -07:00
Mike Ruberry
686e281bcf Updates div to perform true division (#42907)
Summary:
This PR:

- updates div to perform true division
- makes torch.true_divide an alias of torch.div

This follows on work in previous PyTorch releases that first deprecated div performing "integer" or "floor" division, then prevented it by throwing a runtime error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42907

Reviewed By: ngimel

Differential Revision: D23622114

Pulled By: mruberry

fbshipit-source-id: 414c7e3c1a662a6c3c731ad99cc942507d843927
2020-09-14 15:50:38 -07:00
kshitij12345
c68a99bd61 [numpy] Add torch.exp2 (#44184)
Summary:
Reference https://github.com/pytorch/pytorch/issues/42515

TODO
* [x] Add tests
* [x] Add docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44184

Reviewed By: ngimel

Differential Revision: D23674237

Pulled By: mruberry

fbshipit-source-id: 7f4fb1900fad3051cd7fc9d3d7f6d985c5fb093c
2020-09-14 04:05:37 -07:00
kshitij12345
42f9f2f38f [fix] ReduceOps throw error if dim is repeated (#44281)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44273

TODO

* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44281

Reviewed By: zhangguanheng66

Differential Revision: D23569004

Pulled By: ezyang

fbshipit-source-id: 1ca6523fef168c8ce252aeb7ca418be346b297bf
2020-09-11 15:34:06 -07:00
guol-fnst
b6b1c01adf torch.view_as_complex fails with segfault for a zero dimensional tensor (#44175)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44175

Reviewed By: colesbury

Differential Revision: D23628103

Pulled By: anjali411

fbshipit-source-id: 6f70b5824150121a1617c0757499832923ae02b5
2020-09-11 08:35:49 -07:00
Xiao Wang
b5d75dddd9 Enable lerp on half type; fix output memory format (#43541)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43541

Reviewed By: zou3519

Differential Revision: D23499592

Pulled By: ezyang

fbshipit-source-id: 9efdd6cbf0a334ec035ddd467667ba874b892549
2020-09-10 21:50:35 -07:00
Peter Bell
129d52aef2 Fix uniqueness check in movedim (#44307)
Summary:
Noticed this bug in `torch.movedim` (https://github.com/pytorch/pytorch/issues/41480). [`std::unique`](https://en.cppreference.com/w/cpp/algorithm/unique) only guarantees uniqueness for _sorted_ inputs. The current check lets through non-unique values when they aren't adjacent to each other in the list, e.g. `(0, 1, 0)` wouldn't raise an exception and instead the algorithm fails later with an internal assert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44307

Reviewed By: mrshenli

Differential Revision: D23598311

Pulled By: zou3519

fbshipit-source-id: fd6cc43877c42bb243cfa85341c564b6c758a1bf
2020-09-10 17:41:07 -07:00
Mike Ruberry
c48f511c7e Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: mrshenli, ngimel

Differential Revision: D23617361

Pulled By: mruberry

fbshipit-source-id: edb292947769967de9383f6a84eb327f027509e0
2020-09-10 17:31:50 -07:00
Kurt Mohler
28a23fce4c Deprecate torch.norm and torch.functional.norm (#44321)
Summary:
Part of https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44321

Reviewed By: mrshenli

Differential Revision: D23617273

Pulled By: mruberry

fbshipit-source-id: 6f88b5cb097fd0acb9cf0e415172c5a86f94e9f2
2020-09-10 01:16:41 -07:00
Elias Ellison
e0c65abd38 Revert D23568330: [pytorch][PR] Moves some of TestTorchMathOps to OpInfos
Test Plan: revert-hammer

Differential Revision:
D23568330 (a953a825cc)

Original commit changeset: 03e69fccdbfd

fbshipit-source-id: 04ec6843c5eb3c84ddf226dad0088172d9bed84d
2020-09-09 15:48:56 -07:00
mattip
758c2b96f5 BUG: make cholesky_solve_out do broadcast, error checking (#43137)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42695

test, fix `cholesky_solve_out` to use error checking and broadcasting from `cholesky_solve`. Test segfaults before, passes after the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43137

Reviewed By: izdeby

Differential Revision: D23568589

Pulled By: malfet

fbshipit-source-id: 41b67ba964b55e59f1897eef0d96e0f6e1725bef
2020-09-09 11:38:36 -07:00
Mike Ruberry
a953a825cc Moves some of TestTorchMathOps to OpInfos (#44277)
Summary:
This PR fixes three OpInfo-related bugs and moves some functions from TestTorchMathOps to be tested using the OpInfo pattern. The bugs are:

- A skip test path in test_ops.py incorrectly formatted its string argument
- Decorating the tests in common_device_type.py was incorrectly always applying decorators to the original test, not the op-specific variant of the test. This could cause the same decorator to be applied multiple times, overriding past applications.
- make_tensor was incorrectly constructing tensors in some cases

The functions moved are:

- asin
- asinh
- sinh
- acosh
- tan
- atan
- atanh
- tanh
- log
- log10
- log1p
- log2

In a follow-up PR more or all of the remaining functions in TestTorchMathOps will be refactored as OpInfo-based tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44277

Reviewed By: ngimel

Differential Revision: D23568330

Pulled By: mruberry

fbshipit-source-id: 03e69fccdbfd560217c34ce4e9a5f20e10d05a5e
2020-09-09 09:41:03 -07:00
Natalia Gimelshein
ecc6358dbe Port nonzero cuda from THC to ATen (#44259)
Summary:
1) Ports nonzero from THC to ATen
2) replaces most thrust uses with cub, to avoid synchronization and to improve performance. There is still one necessary synchronization point, communicating number of nonzero elements from GPU to CPU
3) slightly changes algorithm, now we first compute the number of nonzeros, and then allocate correct-sized output, instead of allocating full-sized output as was done before, to account for possibly all elements being non-zero
4) unfortunately, since the last transforms are still done with thrust, 2) is slightly beside the point, however it is a step towards a future without thrust
4) hard limits the number of elements in the input tensor to MAX_INT. Previous implementation allocated a Long tensor with the size ndim*nelements, so that would be at least 16 GB for a tensor with MAX_INT elements. It is reasonable to say that larger tensors could not be used anyway.

Benchmarking is done for tensors with approximately half non-zeros
<details><summary>Benchmarking script</summary>
<p>

```
import torch
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys

device = "cuda"
results = []
for numel in (1024 * 128,):#, 1024 * 1024, 1024 * 1024 * 128):
    inp = torch.randint(2, (numel,), device="cuda", dtype=torch.float)
    for ndim in range(2,3):#(1,4):
        if ndim == 1:
            shape = (numel,)
        elif ndim == 2:
            shape = (1024, numel // 1024)
        else:
            shape = (1024, 128, numel // 1024 // 128)
        inp = inp.reshape(shape)
        repeats = 3
        timer = Timer(stmt="torch.nonzero(inp, as_tuple=False)", label="Nonzero", sub_label=f"number of elts {numel}",
        description = f"ndim {ndim}", globals=globals())
        for i in range(repeats):
            results.append(timer.blocked_autorange())
        print(f"\rnumel {numel} ndim {ndim}", end="")
        sys.stdout.flush()

comparison = Compare(results)
comparison.print()
```
</p>
</details>

### Results
Before:
```
[--------------------------- Nonzero ---------------------------]
                                 |  ndim 1  |   ndim 2  |   ndim 3
 1 threads: ------------------------------------------------------
       number of elts 131072     |    55.2  |     71.7  |     90.5
       number of elts 1048576    |   113.2  |    250.7  |    497.0
       number of elts 134217728  |  8353.7  |  23809.2  |  54602.3

 Times are in microseconds (us).
```
After:
```
[-------------------------- Nonzero --------------------------]
                                |  ndim 1  |  ndim 2  |  ndim 3
1 threads: ----------------------------------------------------
      number of elts 131072     |    48.6  |    79.1  |    90.2
      number of elts 1048576    |    64.7  |   134.2  |   161.1
      number of elts 134217728  |  3748.8  |  7881.3  |  9953.7

Times are in microseconds (us).

```
There's a real regression for smallish 2D tensor due to added work of computing number of nonzero elements, however, for other sizes there are significant gains, and there are drastically lower memory requirements. Perf gains would be even larger for tensors with fewer nonzeros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44259

Reviewed By: izdeby

Differential Revision: D23581955

Pulled By: ngimel

fbshipit-source-id: 0b99a767fd60d674003d83f0848dc550d7a363dc
2020-09-08 20:52:51 -07:00
Mike Ruberry
bb861e1d69 Ports CUDA var and std reduce all (with no out argument) to ATen, fixes var docs (#43858)
Summary:
When var and std are called without args (other than unbiased) they currently call into TH or THC. This PR:

- Removes the THC var_all and std_all functions and updates CUDA var and std to use the ATen reduction
- Fixes var's docs, which listed its arguments in the incorrect order
- Adds new tests comparing var and std with their NumPy counterparts

Performance appears to have improved as a result of this change. I ran experiments on 1D tensors, 1D tensors with every other element viewed ([::2]), 2D tensors and 2D transposed tensors. Some notable datapoints:

- torch.randn((8000, 8000))
  - var measured 0.0022215843200683594s on CUDA before the change
  - var measured 0.0020322799682617188s on CUDA after the change
- torch.randn((8000, 8000)).T
  - var measured .015128850936889648 on CUDA before the change
  - var measured 0.001912832260131836 on CUDA after the change
- torch.randn(8000 ** 2)
  - std measured 0.11031460762023926 on CUDA before the change
  - std measured 0.0017833709716796875 on CUDA after the change

Timings for var and std are, as expected, similar.

On the CPU, however, the performance change from making the analogous update was more complicated, and ngimel and I decided not to remove CPU var_all and std_all. ngimel wrote the following script that showcases how single-threaded CPU inference would suffer from this change:

```
import torch
import numpy as np
from torch.utils._benchmark import Timer
from torch.utils._benchmark import Compare
import sys
base = 8
multiplier = 1

def stdfn(a):
    meanv = a.mean()
    ac = a-meanv
    return torch.sqrt(((ac*ac).sum())/a.numel())

results = []
num_threads=1
for _ in range(7):
    size = base*multiplier
    input = torch.randn(size)

    tasks = [("torch.var(input)", "torch_var"),
             ("torch.var(input, dim=0)", "torch_var0"),
             ("stdfn(input)", "stdfn"),
             ("torch.sum(input, dim=0)", "torch_sum0")
            ]
    timers = [Timer(stmt=stmt, num_threads=num_threads, label="Index", sub_label=f"{size}",
    description=label, globals=globals()) for stmt, label in tasks]
    repeats = 3

    for i, timer in enumerate(timers * repeats):
        results.append(
            timer.blocked_autorange()
        )
        print(f"\r{i + 1} / {len(timers) * repeats}", end="")
        sys.stdout.flush()
    multiplier *=10
print()

comparison = Compare(results)

comparison.print()
```

The TH timings using this script on my devfair are:

```
[------------------------------ Index ------------------------------]
        | torch_var | torch_var0 |  stdfn  | torch_sum0
1 threads: ----------------------------------------------------------
   8    |   16.0  |    5.6  |   40.9 |    5.0
   80    |   15.9  |    6.1  |   41.6 |    4.9
   800   |   16.7  |   12.0  |   42.3 |    5.0
   8000   |   27.2  |   72.7  |   51.5 |    6.2
   80000  |   129.0  |   715.0  |  133.0 |   18.0
   800000  |  1099.8  |  6961.2  |  842.0 |   112.6
   8000000 |  11879.8  |  68948.5  | 20138.4 |  1750.3
```

and the ATen timings are:

```
[------------------------------ Index ------------------------------]
               |  torch_var  |  torch_var0  |   stdfn   |  torch_sum0
1 threads: ----------------------------------------------------------
      8              |       4.3   |       5.4    |     41.4  |       5.4
      80            |       4.9   |       5.7    |     42.6  |       5.4
      800          |      10.7   |      11.7    |     43.3  |       5.5
      8000        |      69.3   |      72.2    |     52.8  |       6.6
      80000      |     679.1   |     676.3    |    129.5  |      18.1
      800000    |    6770.8   |    6728.8    |    819.8  |     109.7
      8000000  |   65928.2   |   65538.7    |  19408.7  |    1699.4
```

which demonstrates that performance is analogous to calling the existing var and std with `dim=0` on a 1D tensor. This would be a significant performance hit. Another simple script shows the performance is mixed when using multiple threads, too:

```
import torch
import time

# Benchmarking var and std, 1D with varying sizes
base = 8
multiplier = 1

op = torch.var
reps = 1000

for _ in range(7):
    size = base * multiplier
    t = torch.randn(size)
    elapsed = 0
    for _ in range(reps):
        start = time.time()
        op(t)
        end = time.time()
        elapsed += end - start
    multiplier *= 10

    print("Size: ", size)
    print("Avg. elapsed time: ", elapsed / reps)
```

```
var cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.7853736877441406e-05 vs 4.9788951873779295e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.7803430557250977e-05 vs 6.156444549560547e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.8569469451904296e-05 vs 1.2302875518798827e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.8756141662597655e-05 vs. 6.97789192199707e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00026622867584228516 vs. 0.0002447957992553711 (ATen wins)
Size:  800000
Avg. elapsed time:  0.0010556647777557374 vs 0.00030616092681884767 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009990205764770508 vs 0.002938544034957886 (ATen wins)

std cpu TH vs ATen timings

Size:  8
Avg. elapsed time:  1.6681909561157225e-05 vs. 4.659652709960938e-06 (ATen wins)
Size:  80
Avg. elapsed time:  1.699185371398926e-05 vs. 5.431413650512695e-06 (ATen wins)
Size:  800
Avg. elapsed time:  1.768803596496582e-05 vs. 1.1279821395874023e-05 (ATen wins)
Size:  8000
Avg. elapsed time:  2.7791500091552735e-05  vs 7.031106948852539e-05 (TH wins)
Size:  80000
Avg. elapsed time:  0.00018650460243225096 vs 0.00024368906021118164 (TH wins)
Size:  800000
Avg. elapsed time:  0.0010522041320800782 vs 0.0003039860725402832 (ATen wins)
Size:  8000000
Avg. elapsed time:  0.009976618766784668 vs. 0.0029211788177490234 (ATen wins)
```

These results show the TH solution still performs better than the ATen solution with default threading for some sizes.

It seems like removing CPU var_all and std_all will require an improvement in ATen reductions. https://github.com/pytorch/pytorch/issues/40570 has been updated with this information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43858

Reviewed By: zou3519

Differential Revision: D23498981

Pulled By: mruberry

fbshipit-source-id: 34bee046c4872d11c3f2ffa1b5beee8968b22050
2020-09-06 09:40:54 -07:00
Muthu Arivoli
719d29dab5 Implement torch.i0 and torch.kaiser_window (#43132)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43132

Reviewed By: smessmer

Differential Revision: D23479072

Pulled By: mruberry

fbshipit-source-id: 4fb1de44830771c6a7222cf19f7728d9ac7c043b
2020-09-05 23:11:47 -07:00
Gao, Xiang
5a0d65b06b Further expand coverage of addmm/addmv, fix 0 stride (#43980)
Summary:
- test beta=0, self=nan
- test transposes
- fixes broadcasting of addmv
- not supporting tf32 yet, will do it in future PR together with other testing fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43980

Reviewed By: mruberry

Differential Revision: D23507559

Pulled By: ngimel

fbshipit-source-id: 14ee39d1a0e13b9482932bede3fccb61fe6d086d
2020-09-04 23:03:23 -07:00
yangu
6cecf7ec68 Enable test_cublas_config_deterministic_error for windows (#42796)
Summary:
test_cublas_config_deterministic_error can pass for windows, so enable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42796

Reviewed By: seemethere

Differential Revision: D23520002

Pulled By: malfet

fbshipit-source-id: eccedbbf202b1cada795071a34e266b2c635c2cf
2020-09-04 09:52:57 -07:00
Xiang Gao
bc45c47aa3 Expand the coverage of test_addmm and test_addmm_sizes (#43831)
Summary:
- This test is very fast and very important, so it makes no sense in marking it as slowTest
- This test is should also run on CUDA
- This test should check alpha and beta support
- This test should check `out=` support
- manual computation should use list instead of index_put because list is much faster
- precision for TF32 needs to be fixed. Will do it in future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831

Reviewed By: ailzhang

Differential Revision: D23435032

Pulled By: ngimel

fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a
2020-09-02 20:51:49 -07:00
Vasiliy Kuznetsov
6a6552576d rename _min_max to _aminmax (#44001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44001

This is to align with the naming in numpy and in
https://github.com/pytorch/pytorch/pull/43092

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_aminmax_cpu_float32
python test/test_torch.py TestTorchDeviceTypeCUDA.test_aminmax_cuda_float32
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23465298

fbshipit-source-id: b599035507156cefa53942db05f93242a21c8d06
2020-09-02 18:07:55 -07:00
Vasiliy Kuznetsov
486a9fdab2 _min_max.dim: CUDA implementation (#42943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42943

Adds a CUDA kernel for _min_max_val.dim

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

performance: ~50% savings on a tensor representative of quantization workloads: https://gist.github.com/vkuzo/3e16c645e07a79dd66bcd50629ff5db0

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086797

fbshipit-source-id: 04a2d310f64a388d48ab8131538dbd287900ca4a
2020-09-02 18:07:51 -07:00
Vasiliy Kuznetsov
834279f4ab _min_max_val.dim: CPU implementation (#42894)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42894

Continuing the min_max kernel implementation, this PR adds the
CPU path when a dim is specified.  Next PR will replicate for CUDA.

Note: after a discussion with ngimel, we are taking the fast path
of calculating the values only and not the indices, since that is what
is needed for quantization, and calculating indices would require support
for reductions on 4 outputs which is additional work.  So, the API
doesn't fully match `min.dim` and `max.dim`.

Flexible on the name, let me know if something else is better.

Test Plan:
correctness:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_minmax_cpu_float32
```

performance: seeing a 49% speedup on a min+max tensor with similar shapes
to what we care about for quantization observers (bench:
https://gist.github.com/vkuzo/b3f24d67060e916128a51777f9b89326). For
other shapes (more dims, different dim sizes, etc), I've noticed a
speedup as low as 20%, but we don't have a good use case to optimize
that so perhaps we can save that for a future PR.

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23086798

fbshipit-source-id: b24ce827d179191c30eccf31ab0b2b76139b0ad5
2020-09-02 18:07:47 -07:00
Vasiliy Kuznetsov
78994d165f min_max kernel: add CUDA (#42868)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42868

Adds a CUDA kernel for the _min_max function.

Note: this is a re-submit of https://github.com/pytorch/pytorch/pull/41805,
was faster to resubmit than to ressurect that one.  Thanks to durumu
for writing the original implementation!

Future PRs will add index support, docs, and hook this up to observers.

Test Plan:
```
python test/test_torch.py TestTorchDeviceTypeCUDA.test_minmax_cuda_float32
```

Basic benchmarking shows a 50% reduction in time to calculate min + max:
https://gist.github.com/vkuzo/b7dd91196345ad8bce77f2e700f10cf9

TODO

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23057766

fbshipit-source-id: 70644d2471cf5dae0a69343fba614fb486bb0891
2020-09-02 18:06:03 -07:00
anjali411
129f406062 Make torch.conj() a composite function and return self for real tensors (#43270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43270

`torch.conj` is a very commonly used operator for complex tensors, but it's mathematically a no op for real tensors. Switching to tensorflow gradients for complex tensors (as discussed in #41857) would involve adding `torch.conj()` to the backward definitions for a lot of operators. In order to preserve autograd performance for real tensors and maintain numpy compatibility for `torch.conj`, this PR updates `torch.conj()` which behaves the same for complex tensors but performs a view/returns `self` tensor for tensors of non-complex dtypes. The documentation states that the returned tensor for a real input shouldn't be mutated. We could perhaps return an immutable tensor for this case in future when that functionality is available (zdevito ezyang ).

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460493

Pulled By: anjali411

fbshipit-source-id: 3b3bf0af55423b77ff2d0e29f5d2c160291ae3d9
2020-09-02 17:06:04 -07:00
kshitij12345
b6b5ebc345 Add torch.vdot (#43004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43004

Reviewed By: mruberry

Differential Revision: D23318935

Pulled By: anjali411

fbshipit-source-id: 12d4824b7cb42bb9ca703172c54ec5c663d9e325
2020-09-02 09:00:30 -07:00
Peter Bell
c88ac25679 Check for internal memory overlap in some indexing-type functions (#43423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43423

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23298652

Pulled By: zou3519

fbshipit-source-id: c13c59aec0c6967ef0d6365d782c1f4c98c04227
2020-09-02 08:51:50 -07:00
Peter Bell
5807bb92d3 TensorIteratorConfig: Check memory overlap by default (#43422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43422

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D23298653

Pulled By: zou3519

fbshipit-source-id: a7b66a8a828f4b35e31e8be0c07e7fe9339181f2
2020-09-02 08:50:29 -07:00
Hong Xu
4bb5d33076 is_numpy_scalar should also consider bool and complex types (#43644)
Summary:
Before this PR,

```python
import torch
import numpy as np

a = torch.tensor([1, 2], dtype=torch.bool)
c = np.array([1, 2], dtype=np.bool)
print(a[0] == c[0])

a = torch.tensor([1, 2], dtype=torch.complex64)
c = np.array([1, 2], dtype=np.complex64)
print(a[0] == c[0])

 # This case is still broken
a = torch.tensor([1 + 1j, 2 + 2j], dtype=torch.complex64)
c = np.array([1 + 1j, 2 + 2j], dtype=np.complex64)
print(a[0] == c[0])
```

outputs

```
False
False
False
```

After this PR, it outputs:

```
tensor(True)
/home/user/src/pytorch/torch/tensor.py:25: ComplexWarning: Casting complex values to real discards the imaginary part return f(*args, **kwargs)
tensor(True)
tensor(False)
```

Related issue: https://github.com/pytorch/pytorch/issues/43579

cc anjali411 mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43644

Reviewed By: ailzhang

Differential Revision: D23425569

Pulled By: anjali411

fbshipit-source-id: a868209376b30cea601295e54015c47803923054
2020-09-02 07:41:50 -07:00
Xiang Gao
b1f19c20d6 Run function check and out check in TestTensorDeviceOps (#43830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43830

Reviewed By: ailzhang

Differential Revision: D23438101

Pulled By: mruberry

fbshipit-source-id: b581ce779ea2f50ea8dfec51d5469031ec7a0a67
2020-09-01 08:21:53 -07:00
kiyosora
3682df77db Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: ngimel

Differential Revision: D23416743

Pulled By: mruberry

fbshipit-source-id: 9975bd9c9fa73bd0958fe9879f79a692aeb722d5
2020-08-31 15:54:56 -07:00
kshitij12345
0394c5a283 [fix] torch.multinomial : fix for 0 size dim (#43775)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43768

TO-DO:
* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43775

Reviewed By: ZolotukhinM

Differential Revision: D23421979

Pulled By: ngimel

fbshipit-source-id: 949fcdd30f18d17ae1c372fa6ca6a0b8d0d538ce
2020-08-31 11:57:42 -07:00
Xiang Gao
4ef12be900 Add __complex__ (#43844)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43844

Reviewed By: ZolotukhinM

Differential Revision: D23422000

Pulled By: ngimel

fbshipit-source-id: ebc6a27a9b04c77c3977e6c184cefce9e817cc2f
2020-08-31 11:39:41 -07:00
Gao, Xiang
c5d0f091b2 addmm/addmv should accept complex alpha and beta (#43827)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43827

Reviewed By: malfet

Differential Revision: D23415869

Pulled By: ngimel

fbshipit-source-id: a47b76df5fb751f76d36697f5fd95c69dd3a6efe
2020-08-31 11:35:58 -07:00
Xiang Gao
a860be898e [resubmit] Add amax/amin (#43819)
Summary:
Resubmit for landing next week.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43819

Reviewed By: ngimel

Differential Revision: D23421906

Pulled By: mruberry

fbshipit-source-id: 23dd60d1e365bb1197d660c3bfad7ee07ba3e97f
2020-08-31 04:54:48 -07:00
Jeff Daily
8fb7c50250 Enable complex blas for ROCm. (#43744)
Summary:
Revert "Skips some complex tests on ROCm (https://github.com/pytorch/pytorch/issues/42759)".  This reverts commit 55b1706775.

Use new cuda_to_hip_mappings.py from https://github.com/pytorch/pytorch/issues/43004.

Fixes https://github.com/pytorch/pytorch/pull/42383#issuecomment-670771922

CC sunway513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43744

Reviewed By: glaringlee

Differential Revision: D23391263

Pulled By: ngimel

fbshipit-source-id: ddf734cea3ba69c24f0d79cf1b87c05cdb45ec3d
2020-08-30 22:43:54 -07:00
Xiang Gao
550fb2fd52 Expand the coverage of test_blas_empty (#43822)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43822

Reviewed By: mruberry

Differential Revision: D23413359

Pulled By: ngimel

fbshipit-source-id: fcdb337e32ed2d1c791fa0762d5233b346b26d14
2020-08-29 12:13:15 -07:00
Nikita Shulga
d10056652b Enable torch.half for lt and masked_select (#43704)
Summary:
Enable testing of those options in `TestTorchDeviceTypeCPU.test_logical_cpu` and `TestTorchDeviceTypeCPU.test_masked_select_cpu_float16`
Add `view_as_real` testing for `torch.complex32` type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43704

Reviewed By: albanD

Differential Revision: D23373070

Pulled By: malfet

fbshipit-source-id: 00f17f23b48513379a414227aea91e2d3c0dd5f9
2020-08-29 02:37:26 -07:00
Nikita Shulga
64906497cd Revert D23391941: [pytorch][PR] Implementing NumPy-like function torch.heaviside()
Test Plan: revert-hammer

Differential Revision:
D23391941 (a1eae6d158)

Original commit changeset: 7b942321a625

fbshipit-source-id: c2a7418a1fedaa9493300945c30e2392fc0d08ee
2020-08-28 19:16:58 -07:00
Kurt Mohler
68b9daa9bf Add torch.linalg.norm (#42749)
Summary:
Adds `torch.linalg.norm` function that matches the behavior of `numpy.linalg.norm`.

Additional changes:
* Add support for dimension wrapping in `frobenius_norm` and `nuclear_norm`
* Fix `out` argument behavior for `nuclear_norm`
* Fix issue where `frobenius_norm` allowed duplicates in `dim` argument
* Add `_norm_matrix`

Closes https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42749

Reviewed By: ngimel

Differential Revision: D23336234

Pulled By: mruberry

fbshipit-source-id: f0aba3089a3a0bf856aa9c4215e673ff34228fac
2020-08-28 18:28:33 -07:00
kiyosora
a1eae6d158 Implementing NumPy-like function torch.heaviside() (#42523)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.heaviside()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42523

Reviewed By: glaringlee

Differential Revision: D23391941

Pulled By: mruberry

fbshipit-source-id: 7b942321a62567a5fc0a3679a289f4c4c19e6134
2020-08-28 18:11:20 -07:00
Nikita Shulga
3f0120edb4 Revert D23360705: [pytorch][PR] Add amax/amin
Test Plan: revert-hammer

Differential Revision:
D23360705 (bcec8cc3f9)

Original commit changeset: 5bdeb08a2465

fbshipit-source-id: 76a9e199823c7585e55328bad0778bcd8cd49381
2020-08-28 18:01:25 -07:00
Gao, Xiang
bcec8cc3f9 Add amax/amin (#43092)
Summary:
Add a max/min operator that only return values.

## Some important decision to discuss
| **Question**                          | **Current State** |
|---------------------------------------|-------------------|
| Expose torch.max_values to python?    | No                |
| Remove max_values and only keep amax? | Yes               |
| Should amax support named tensors?    | Not in this PR    |

## Numpy compatibility

Reference: https://numpy.org/doc/stable/reference/generated/numpy.amax.html

| Parameter                                                                                                                                                                                                                                              | PyTorch Behavior                                                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `axis`:  None or int or tuple of ints, optional. Axis or axes along which to operate. By default, flattened input is used. If this is a tuple of ints, the maximum is selected over multiple axes, instead of a single axis or all the axes as before. | Named `dim`, behavior same as `torch.sum` (https://github.com/pytorch/pytorch/issues/29137)                                |
| `out`: ndarray, optional. Alternative output array in which to place the result. Must be of the same shape and buffer length as the expected output.                                                                                                   | Same                                                                              |
| `keepdims`: bool, optional. If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.                                      | implemented as `keepdim`                                                          |
| `initial`: scalar, optional. The minimum value of an output element. Must be present to allow computation on empty slice.                                                                                                                              | Not implemented in this PR. Better to implement for all reductions in the future. |
| `where`: array_like of bool, optional. Elements to compare for the maximum.                                                                                                                                                                            | Not implemented in this PR. Better to implement for all reductions in the future. |

**Note from numpy:**
> NaN values are propagated, that is if at least one item is NaN, the corresponding max value will be NaN as well. To ignore NaN values (MATLAB behavior), please use nanmax.

PyTorch has the same behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43092

Reviewed By: ngimel

Differential Revision: D23360705

Pulled By: mruberry

fbshipit-source-id: 5bdeb08a2465836764a5a6fc1a6cc370ae1ec09d
2020-08-28 12:51:03 -07:00
Peter Bell
c177d25edf TensorIterator: Check for memory overlap in all nullary_ops (#43421)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43421

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298654

Pulled By: zou3519

fbshipit-source-id: 71b401f6ea1e3b50b830fef650927cc5b3fb940f
2020-08-28 08:40:25 -07:00
Peter Bell
dc0722e9b7 TensorIterator: Check for memory overlap in all compare_ops (#43420)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43420

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23298650

Pulled By: zou3519

fbshipit-source-id: 171cd17a3012880a5d248ffd0ea6942fbfb6606f
2020-08-28 08:40:22 -07:00
Peter Bell
065ebdb92f TensorIterator: Check for memory overlap in all binary_ops (#43419)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43419

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D23298655

Pulled By: zou3519

fbshipit-source-id: 82e0ff308a6a7e46b4342d57ddb4c1d73745411a
2020-08-28 08:40:19 -07:00
kshitij12345
c7787f7fbf [numpy compatibility]Fix argmin/argmax when multiple max/min values (#42004)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41998
Fixes https://github.com/pytorch/pytorch/issues/22853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42004

Reviewed By: ngimel

Differential Revision: D23049003

Pulled By: mruberry

fbshipit-source-id: a6fddbadfec4b8696730550859395ce4f0cf50d6
2020-08-28 06:42:42 -07:00
kshitij12345
01b5c06254 [fix] handle empty args in chain_matmul (#43553)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43553

Reviewed By: agolynski

Differential Revision: D23342586

Pulled By: mruberry

fbshipit-source-id: c6349f8fa9fcefcf03681d92c085a21265d1e690
2020-08-26 18:54:46 -07:00
Xiong Wei
033b7ae3ef implement NumPy-like functionality maximum, minimum (#42579)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Implement NumPy-like functions `maximum` and `minimum`.
The `maximum` and `minimum` functions compute input tensors element-wise, returning a new array with the element-wise maxima/minima.

If one of the elements being compared is a NaN, then that element is returned, both `maximum` and `minimum` functions do not support complex inputs.

This PR also promotes the overloaded versions of torch.max and torch.min, by re-dispatching binary `torch.max` and `torch.min` to `torch.maximum` and `torch.minimum`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42579

Reviewed By: mrshenli

Differential Revision: D23153081

Pulled By: mruberry

fbshipit-source-id: 803506c912440326d06faa1b71964ec06775eac1
2020-08-26 16:56:12 -07:00
Gao, Xiang
88e35fb8bd Skip SVD tests when no lapack (#43566)
Summary:
These tests are failing on one of my system that does not have lapack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43566

Reviewed By: ZolotukhinM

Differential Revision: D23325378

Pulled By: mruberry

fbshipit-source-id: 5d795e460df0a2a06b37182d3d4084d8c5c8e751
2020-08-26 15:58:31 -07:00
Mike Ruberry
4dc8f3be8c Creates test_tensor_creation_ops.py test suite (#43104)
Summary:
As part of our continued refactoring of test_torch.py, this takes tests for tensor creation ops like torch.eye, torch.randint, and torch.ones_like and puts them in test_tensor_creation_ops.py. There hare three test classes in the new test suite: TestTensorCreation, TestRandomTensorCreation, TestLikeTensorCreation. TestViewOps and tests for construction of tensors from NumPy arrays have been left in test_torch.py. These might be refactored separately into test_view_ops.py and test_numpy_interop.py in the future.

Most of the tests ported from test_torch.py were left as is or received a signature change to make them nominally "device generic." Future work will need to review test coverage and update the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43104

Reviewed By: ngimel

Differential Revision: D23280358

Pulled By: mruberry

fbshipit-source-id: 469325dd1a734509dd478cc7fe0413e276ffb192
2020-08-22 23:18:54 -07:00
XiaobingSuper
98307a2821 Fix bfloat16 erfinv get incorrect value problem for cpu path (#43399)
Summary:
Fix https://github.com/pytorch/pytorch/issues/43344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43399

Reviewed By: albanD

Differential Revision: D23264789

Pulled By: pbelevich

fbshipit-source-id: 8b77c0f6ca44346e44599844fb1e172fdbd9df6c
2020-08-21 19:59:37 -07:00
Mike Ruberry
3aec1185e0 Enables bfloat16 x [float16, complex64, complex128] type promotion (#43324)
Summary:
Implements bfloat16 type promotion consistent with JAX (see https://jax.readthedocs.io/en/latest/type_promotion.html), addressing issue https://github.com/pytorch/pytorch/issues/43049.

- bfloat16 x float16 -> float32
- bfloat16 x complex64 -> complex64
- bfloat16 x complex128 -> complex128

Existing tests, after updates, are sufficient to validate the new behavior.

cc xuhdev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43324

Reviewed By: albanD

Differential Revision: D23259823

Pulled By: mruberry

fbshipit-source-id: ca9c2c7d0325faced1f884f3c37edf8fa8c8b089
2020-08-21 10:48:04 -07:00
Mike Ruberry
c64594f5cc Extends test_unary_ufunc.py with numerics, contiguity, domain tests (#42965)
Summary:
This PR:

- ports the tests in TestTorchMathOps to test_unary_ufuncs.py
- removes duplicative tests for the tested unary ufuncs from test_torch.py
- adds a new test, test_reference_numerics, that validates the behavior of our unary ufuncs vs. reference implementations on empty, scalar, 1D, and 2D tensors that are contiguous, discontiguous, and that contain extremal values, for every dtype the unary ufunc supports
- adds support for skipping tests by regex, this behavior is used to make the test suite pass on Windows, MacOS, and ROCm builds, which have a variety of issues, and on Linux builds (see https://github.com/pytorch/pytorch/issues/42952)
- adds a new OpInfo helper, `supports_dtype`, to facilitate test writing
- extends unary ufunc op info to include reference, domain, and extremal value handling information
- adds OpInfos for `torch.acos` and `torch.sin`

These improvements reveal that our testing has been incomplete on several systems, especially with larger float values and complex values, and several TODOs have been added for follow-up investigations. Luckily when writing tests that cover many ops we can afford to spend additional time crafting the tests and ensuring coverage.

Follow-up PRs will:

- refactor TestTorchMathOps into test_unary_ufuncs.py
- continue porting tests from test_torch.py to test_unary_ufuncs.py (where appropriate)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42965

Reviewed By: pbelevich

Differential Revision: D23238083

Pulled By: mruberry

fbshipit-source-id: c6be317551453aaebae9d144f4ef472f0b3d08eb
2020-08-20 22:02:00 -07:00
Nikita Shulga
e10aa47615 Fix at::native::view_as_real() for ComplexHalf Tensors (#43279)
Summary:
Add ComplexHalf case to toValueType, which fixes the logic how view_as_real and view_as_complex slices complex tensor to the floating point one, as it is used to generate tensor of random complex values, see:
018b4d7abb/aten/src/ATen/native/DistributionTemplates.h (L200)
Also add ability to convert python complex object to `c10::complex<at::Half>`

Add `torch.half` and `torch.complex32` to the list of `test_randn` dtypes

Fixes https://github.com/pytorch/pytorch/issues/43143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43279

Reviewed By: mrshenli

Differential Revision: D23230296

Pulled By: malfet

fbshipit-source-id: b4bb66c4c81dd867e72ab7c4563d73f6a4d80a44
2020-08-20 17:38:06 -07:00
Natalia Gimelshein
c8bc298d6c streamline stride propagation logic in TensorIterator (#42922)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41314 among other things.
This PR streamlines layout propagation logic in TensorIterator and removes almost all cases of channels-last hardcoding. The new rules and changes are as follows:
1) behavior of undefined `output` and defined output of the wrong (e.g. 0) size is always the same (before this PR the behavior was divergent)
2) in obvious cases (unary operation on memory-dense tensors, binary operations on memory-dense tensors with the same layout) strides are propagated (before propagation was inconsistent) (see footnote)
3) in other cases the output permutation is obtained as inverse permutation of sorting inputs by strides. Sorting is done with comparator obeying the following rules: strides of broadcasted dimensions are set to 0, and 0 compares equal to anything. Strides of not-broadcasted dimensions (including dimensions of size `1`) participate in sorting. Precedence is given to the first input, in case of a tie in the first input, first the corresponding dimensions are considered, and if that does not indicate that swap is needed, strides of the same dimension in subsequent inputs are considered. See changes in `reorder_dimensions` and `compute_strides`. Note that first inspecting dimensions of the first input allows us to better recover it's permutation (and we select this behavior because it more reliably propagates channels-last strides) but in some rare cases could result in worse traversal order for the second tensor.

These rules are enough to recover previously hard-coded behavior related to channels last, so all existing tests are passing.
In general, these rules will produce intuitive results, and in most cases permutation of the full size input (in case of broadcasted operation) will be recovered, or permutation of the first input (in case of same sized inputs) will be recovered, including cases with trivial (1) dimensions. As an example of the latter, the following tensor
```
x=torch.randn(2,1,3).permute(1,0,2)
```
will produce output with the same stride (3,3,1) in binary operations with 1d tensor. Another example is a tensor of size N1H1 that has strides `H,H,1,1` when contiguous and `H, 1, 1, 1` when channels-last. The output retains these strides in binary operations when another 1d tensor is broadcasted on this one.

Footnote: for ambiguous cases where all inputs are memory dense and have the same physical layout that nevertheless can correspond to different permutations, such as e.g. NC11-sized physically contiguous tensors, regular contiguous tensor is returned, and thus permutation information of the input is lost (so for NC11 channels-last input had the strides `C, 1, C, C`, but output will have the strides `C, 1, 1, 1`). This behavior is unchanged from before and consistent with numpy, but it still makes sense to change it. The blocker for doing it currently is performance of `empty_strided`. Once we make it on par with `empty` we should be able to propagate layouts in these cases. For now, to not slow down common contiguous case, we default to contiguous.
The table below shows how in some cases current behavior loses permutation/stride information, whereas new behavior propagates permutation.
| code                                                                                                                                                                                           | old                                                   | new                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------|
| #strided tensors<br>a=torch.randn(2,3,8)[:,:,::2].permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride()) | (2, 24, 8) <br>(6, 3, 1) <br>(1, 12, 4) <br>(6, 3, 1) | (2, 24, 8)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |
| #memory dense tensors<br>a=torch.randn(3,1,1).as_strided((3,1,1), (1,3,3))<br>print(a.stride(), (a+torch.randn(1)).stride())<br>a=torch.randn(2,3,4).permute(2,0,1)<br>print(a.stride())<br>print(a.exp().stride())<br>print((a+a).stride())<br>out = torch.empty(0)<br>torch.add(a,a,out=out)<br>print(out.stride())                                                                                                                                                                                               |  (1, 3, 3) (1, 1, 1)<br>(1, 12, 4)<br>(6, 3, 1)<br>(1, 12, 4)<br>(6, 3, 1)                                                       |  (1, 3, 3) (1, 3, 3)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4)<br>(1, 12, 4) |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42922

Reviewed By: ezyang

Differential Revision: D23148204

Pulled By: ngimel

fbshipit-source-id: 670fb6188c7288e506e5ee488a0e11efc8442d1f
2020-08-20 10:50:35 -07:00
Nikita Vedeneev
888ae1b3d8 Introducing Matrix exponential (#40161)
Summary:
Implements (batched) matrix exponential. Fixes [https://github.com/pytorch/pytorch/issues/9983](https://github.com/pytorch/pytorch/issues/9983).

The algorithm follows:
```
 Bader, P.; Blanes, S.; Casas, F.
 Computing the Matrix Exponential with an Optimized Taylor Polynomial Approximation.
 Mathematics 2019, 7, 1174.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40161

Reviewed By: zhangguanheng66

Differential Revision: D22951372

Pulled By: ezyang

fbshipit-source-id: aa068cb76d5cf71696b333d3e72cee287b3089e3
2020-08-18 14:15:10 -07:00
anjali411
aab66602c4 Add torch.dot for complex tensors (#42745)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42745

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23056382

Pulled By: anjali411

fbshipit-source-id: c97f15e057095f78069844dbe0299c14104d2fce
2020-08-17 09:05:41 -07:00
Xiaomeng Yang
4ae832e106 Optimize SiLU (Swish) op in PyTorch (#42976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42976

Optimize SiLU (Swish) op in PyTorch.

Some benchmark result

input = torch.rand(1024, 32768, dtype=torch.float, device="cpu")
forward: 221ms -> 133ms
backward: 600ms -> 170ms

input = torch.rand(1024, 32768, dtype=torch.double, device="cpu")
forward: 479ms -> 297ms
backward: 1438ms -> 387ms

input = torch.rand(8192, 32768, dtype=torch.float, device="cuda")
forward: 24.34ms -> 9.83ms
backward: 97.05ms -> 29.03ms

input = torch.rand(4096, 32768, dtype=torch.double, device="cuda")
forward: 44.24ms -> 30.15ms
backward: 126.21ms -> 49.68ms

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "SiLU"

Reviewed By: houseroad

Differential Revision: D23093593

fbshipit-source-id: 1ba7b95d5926c4527216ed211a5ff1cefa3d3bfd
2020-08-16 13:21:57 -07:00
Muthu Arivoli
5bcf9b017a Implement hstack, vstack, dstack (#42799)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42799

Reviewed By: izdeby

Differential Revision: D23140704

Pulled By: mruberry

fbshipit-source-id: 6a36363562c50d0abce87021b84b194bb32825fb
2020-08-15 20:39:14 -07:00
ita
91b090ceaf Add polygamma where n >= 2 (#42499)
Summary:
https://github.com/pytorch/pytorch/issues/40980

I have a few questions during implementing Polygamma function...
so, I made PR prior to complete it.

1. some code blocks brought from cephes library(and I did too)
```
/*
 * The following function comes with the following copyright notice.
 * It has been released under the BSD license.
 *
 * Cephes Math Library Release 2.8:  June, 2000
 * Copyright 1984, 1987, 1992, 2000 by Stephen L. Moshier
 */
```
is it okay for me to use cephes code with this same copyright notice(already in the Pytorch codebases)

2. There is no linting in internal Aten library. (as far as I know, I read https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md)
How do I'm sure my code will follow appropriate guidelines of this library..?

3. Actually, there's a digamma, trigamma function already
digamma is needed, however, trigamma function becomes redundant if  polygamma function is added.
it is okay for trigamma to be there or should be removed?

btw, CPU version works fine with 3-rd order polygamma(it's what we need to play with variational inference with beta/gamma distribution) now and I'm going to finish GPU version soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42499

Reviewed By: gchanan

Differential Revision: D23110016

Pulled By: albanD

fbshipit-source-id: 246f4c2b755a99d9e18a15fcd1a24e3df5e0b53e
2020-08-14 17:00:24 -07:00
Muthu Arivoli
b8102b1550 Implement torch.nextafter (#42580)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42580

Reviewed By: smessmer

Differential Revision: D23012260

Pulled By: mruberry

fbshipit-source-id: ce82a63c4ad407ec6ffea795f575ca7c58cd6137
2020-08-14 00:35:30 -07:00
Will Gan
e4373083a2 torch.complex and torch.polar (#39617)
Summary:
For https://github.com/pytorch/pytorch/issues/35312 and https://github.com/pytorch/pytorch/issues/38458#issuecomment-636066256.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39617

Reviewed By: zhangguanheng66

Differential Revision: D23083926

Pulled By: anjali411

fbshipit-source-id: 1874378001efe2ff286096eaf1e92afe91c55b29
2020-08-14 00:30:11 -07:00
Natalia Gimelshein
f373cda021 Revert D22994446: [pytorch][PR] CUDA reduction: allow outputs to have different strides
Test Plan: revert-hammer

Differential Revision:
D22994446 (7f3f5020e6)

Original commit changeset: cc60beebad2e

fbshipit-source-id: f4635deac386db0c161f910760cace09f15a1ff9
2020-08-12 17:05:04 -07:00
Muthu Arivoli
92885ebe16 Implement hypot (#42291)
Summary:
Related to https://github.com/pytorch/pytorch/issues/38349
Closes https://github.com/pytorch/pytorch/issues/22764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42291

Reviewed By: malfet

Differential Revision: D22951859

Pulled By: mruberry

fbshipit-source-id: d0118f2b6437e5c3f775f699ec46e946a8da50f0
2020-08-12 13:18:26 -07:00
Heitor Schueroff de Souza
62bd2ddec7 Implemented non-named version of unflatten (#42563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42563

Moved logic for non-named unflatten from python nn module to aten/native to be reused by the nn module later. Fixed some inconsistencies with doc and code logic.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D23030301

Pulled By: heitorschueroff

fbshipit-source-id: 7c804ed0baa5fca960a990211b8994b3efa7c415
2020-08-12 13:14:28 -07:00
Xiang Gao
7f3f5020e6 CUDA reduction: allow outputs to have different strides (#42649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42364

Benchmark:
https://github.com/zasdfgbnm/things/blob/master/2020Q3/min-benchmark.ipynb
```python
import torch

print(torch.__version__)
print()

for i in range(100):
    torch.randn(1000, device='cuda')

for e in range(7, 15):
    N = 2 ** e
    input_ = torch.randn(N, N, device='cuda')
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    input_ = torch.randn(N, N, device='cuda').t()
    torch.cuda.synchronize()
    %timeit input_.min(dim=0); torch.cuda.synchronize()
    print()
```
Before
```
1.7.0a0+5d7c3f9

21.7 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 773 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.5 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.2 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.4 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

33 µs ± 474 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.1 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

84.2 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.3 µs ± 105 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

181 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
145 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

542 µs ± 753 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
528 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.04 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.01 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After
```
1.7.0a0+9911817

21.4 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.6 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

22.4 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.5 µs ± 58.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

26.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
20.9 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

35.4 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.7 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

86.5 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.2 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

195 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
153 µs ± 4.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

550 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
527 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2.05 ms ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 ms ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42649

Reviewed By: ezyang

Differential Revision: D22994446

Pulled By: ngimel

fbshipit-source-id: cc60beebad2e04c26ebf3ca702a6cb05846522c9
2020-08-12 13:09:36 -07:00
Kurt Mohler
2f1baf6c25 Fix coding style and safety issues in CuBLAS nondeterministic unit test (#42627)
Summary:
Addresses some comments that were left unaddressed after PR https://github.com/pytorch/pytorch/issues/41377 was merged:

* Use `check_output` instead of `Popen` to run each subprocess sequentially
* Use f-strings rather than old python format string style
* Provide environment variables to subprocess through the `env` kwarg
* Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42627

Reviewed By: malfet

Differential Revision: D22969231

Pulled By: ezyang

fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec
2020-08-12 08:54:28 -07:00
kshitij12345
ab0a04dc9c Add torch.nansum (#38628)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/38628

Reviewed By: VitalyFedyunin

Differential Revision: D22860549

Pulled By: mruberry

fbshipit-source-id: 87fcbfd096d83fc14b3b5622f2301073729ce710
2020-08-11 22:26:04 -07:00
Kurt Mohler
5edd9aa95a Fix manual seed to unpack unsigned long (#42206)
Summary:
`torch.manual_seed` was unpacking its argument as an `int64_t`. This fix changes it to a `uint64_t`.

Fixes https://github.com/pytorch/pytorch/issues/33546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42206

Reviewed By: ezyang

Differential Revision: D22822098

Pulled By: albanD

fbshipit-source-id: 97c978139c5cb2d5b62cc2c963550c758ee994f7
2020-08-11 18:05:34 -07:00
Heitor Schueroff de Souza
c660d2a9ae Initial quantile operator implementation (#42755)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42755

Attempting to land quantile again after being landed here https://github.com/pytorch/pytorch/pull/39417 and reverted here https://github.com/pytorch/pytorch/pull/41616.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23030338

Pulled By: heitorschueroff

fbshipit-source-id: 124a86eea3aee1fdaa0aad718b04863935be26c7
2020-08-11 12:08:17 -07:00
Kurt Mohler
2c8cbd78bd Fix orgqr input size conditions (#42825)
Summary:
* Adds support for `n > k`
* Throw error if `m >= n >= k` is not true
* Updates existing error messages to match argument names shown in public docs
* Adds error tests

Fixes https://github.com/pytorch/pytorch/issues/41776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42825

Reviewed By: smessmer

Differential Revision: D23038916

Pulled By: albanD

fbshipit-source-id: e9bec7b11557505e10e0568599d0a6cb7e12ab46
2020-08-11 10:17:39 -07:00
Kurt Mohler
42b4a7132e Raise error if at::native::embedding is given 0-D weight (#42550)
Summary:
Previously, `at::native::embedding` implicitly assumed that the `weight` argument would be 1-D or greater. Given a 0-D tensor, it would segfault. This change makes it throw a RuntimeError instead.

Fixes https://github.com/pytorch/pytorch/issues/41780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42550

Reviewed By: smessmer

Differential Revision: D23040744

Pulled By: albanD

fbshipit-source-id: d3d315850a5ee2d2b6fcc0bdb30db2b76ffffb01
2020-08-11 08:26:45 -07:00
Mike Ruberry
87970b70a7 Adds 'clip' alias for clamp (#42770)
Summary:
Per title. Also updates our guidance for adding aliases to clarify interned_string and method_test requirements. The alias is tested by extending test_clamp to also test clip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42770

Reviewed By: ngimel

Differential Revision: D23020655

Pulled By: mruberry

fbshipit-source-id: f1d8e751de9ac5f21a4f95d241b193730f07b5dc
2020-08-09 02:46:02 -07:00
Mike Ruberry
55b1706775 Skips some complex tests on ROCm (#42759)
Summary:
Fixes ROCm build on OSS master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42759

Reviewed By: ngimel

Differential Revision: D23011560

Pulled By: mruberry

fbshipit-source-id: 3339ecbd5a0ca47aede6f7c3f84739af1ac820d5
2020-08-07 16:12:32 -07:00
anjali411
c9346ad3b8 [CPU] Added torch.bmm for complex tensors (#42383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42383

Test Plan - Updated existing tests to run for complex dtypes as well.

Also added tests for `torch.addmm`, `torch.badmm`

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22960339

Pulled By: anjali411

fbshipit-source-id: 0805f21caaa40f6e671cefb65cef83a980328b7d
2020-08-07 10:04:20 -07:00
Kurt Mohler
df7c059428 Throw error if torch.set_deterministic(True) is called with nondeterministic CuBLAS config (#41377)
Summary:
For CUDA >= 10.2, the `CUBLAS_WORKSPACE_CONFIG` environment variable must be set to either `:4096:8` or `:16:8` to ensure deterministic CUDA stream usage. This PR adds some logic inside `torch.set_deterministic()` to raise an error if this environment variable is not set properly and CUDA >= 10.2.

Issue https://github.com/pytorch/pytorch/issues/15359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41377

Reviewed By: malfet

Differential Revision: D22758459

Pulled By: ezyang

fbshipit-source-id: 4b96f1e9abf85d94ba79140fd927bbd0c05c4522
2020-08-05 12:42:24 -07:00
Ivan Yashchuk
b9e68e03c4 Fix the bug in THCTensor_(baddbmm) and ATen's addmm_cuda for strided views input (#42425)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42418.

The problem was that the non-contiguous batched matrices were passed to `gemmStridedBatched`.

The following code fails on master and works with the proposed patch:
```python
import torch
x = torch.tensor([[1., 2, 3], [4., 5, 6]], device='cuda:0')
c = torch.as_strided(x, size=[2, 2, 2], stride=[3, 1, 1])
torch.einsum('...ab,...bc->...ac', c, c)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42425

Reviewed By: glaringlee

Differential Revision: D22925266

Pulled By: ngimel

fbshipit-source-id: a72d56d26c7381b7793a047d76bcc5bd45a9602c
2020-08-04 16:11:07 -07:00
Natalia Gimelshein
ec898b1ab5 fix discontiguous inputs/outputs for cummin/cummax (#42507)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42507

Reviewed By: mruberry

Differential Revision: D22917876

Pulled By: ngimel

fbshipit-source-id: 05f3f4a55bcddf6a853552184c9fafcef8d36270
2020-08-04 10:12:07 -07:00
Nikita Shulga
d21e345ef0 Fix segfault in THPGenerator_dealloc (take 2) (#42510)
Summary:
Segfault happens when one tries to deallocate uninitialized generator.
Make `THPGenerator_dealloc` UBSAN-safe by moving implicit cast in the struct definition to reinterpret_cast

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42510

Reviewed By: pbelevich

Differential Revision: D22917469

Pulled By: malfet

fbshipit-source-id: 5eaa68eef10d899ee3e210cb0e1e92f73be75712
2020-08-04 08:06:08 -07:00
Nikita Shulga
0cb86afd72 Revert D22908795: [pytorch][PR] Fix segfault in THPGenerator_dealloc
Test Plan: revert-hammer

Differential Revision:
D22908795 (d3acfe3ba8)

Original commit changeset: c5b6a35db381

fbshipit-source-id: c7559c382fced23cef683c8c90cff2d6012801ec
2020-08-03 21:03:44 -07:00
Natalia Gimelshein
7a5708832f fix masked_select for discontiguous outputs (#41841)
Summary:
This fixes https://github.com/pytorch/pytorch/issues/41473 for discontiguous input, mask and out. Tests to follow. Reverting https://github.com/pytorch/pytorch/issues/33269 is not a great solution because I'm told masked_select was needed for printing complex tensors.
cc gchanan , zou3519, ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41841

Reviewed By: mruberry

Differential Revision: D22706943

Pulled By: ngimel

fbshipit-source-id: 413d7fd3f3308b184de04fd56b8a9aaabcad22fc
2020-08-03 18:43:45 -07:00
Nikita Shulga
d3acfe3ba8 Fix segfault in THPGenerator_dealloc (#42490)
Summary:
Segfault happens when one tries to deallocate unintialized generator

Add `TestTorch.test_invalid_generator_raises` that validates that Generator created on invalid device is handled correctly

Fixes https://github.com/pytorch/pytorch/issues/42281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42490

Reviewed By: seemethere

Differential Revision: D22908795

Pulled By: malfet

fbshipit-source-id: c5b6a35db381738c0fc984aa54e5cab5ef2cbb76
2020-08-03 16:28:34 -07:00
Hong Xu
34025eb826 Vectorize arange (#38697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38697

Benchmark (gcc 8.3, Debian Buster, turbo off, Release build, Intel(R)
Xeon(R) E-2136, Parallelization using OpenMP):

```python
import timeit
for dtype in ('torch.double', 'torch.float', 'torch.uint8', 'torch.int8', 'torch.int16', 'torch.int32', 'torch.int64'):
    for n, t in [(40_000, 50000),
                (400_000, 5000)]:
        print(f'torch.arange(0, {n}, dtype={dtype}) for {t} times')
        print(timeit.timeit(f'torch.arange(0, {n}, dtype={dtype})', setup=f'import torch', number=t))
```

Before:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
1.587841397995362
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.47885190199303906
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.5519152240012772
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.4733216500026174
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
1.426058754004771
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.43596178699226584
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
1.4289699140063021
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.43451592899509706
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.5714442400058033
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.14837959500437137
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.5964003179979045
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.15676555599202402
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8390555799996946
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23184613398916554
```

After:

```
torch.arange(0, 40000, dtype=torch.double) for 50000 times
0.6895066159922862
torch.arange(0, 400000, dtype=torch.double) for 5000 times
0.16820953000569716
torch.arange(0, 40000, dtype=torch.float) for 50000 times
1.3640095089940587
torch.arange(0, 400000, dtype=torch.float) for 5000 times
0.39255041000433266
torch.arange(0, 40000, dtype=torch.uint8) for 50000 times
0.3422072059911443
torch.arange(0, 400000, dtype=torch.uint8) for 5000 times
0.0605111670010956
torch.arange(0, 40000, dtype=torch.int8) for 50000 times
0.3449254590086639
torch.arange(0, 400000, dtype=torch.int8) for 5000 times
0.06115841199061833
torch.arange(0, 40000, dtype=torch.int16) for 50000 times
0.7745441729930462
torch.arange(0, 400000, dtype=torch.int16) for 5000 times
0.22106765500211623
torch.arange(0, 40000, dtype=torch.int32) for 50000 times
0.720475220005028
torch.arange(0, 400000, dtype=torch.int32) for 5000 times
0.20230313099455088
torch.arange(0, 40000, dtype=torch.int64) for 50000 times
0.8144655400101328
torch.arange(0, 400000, dtype=torch.int64) for 5000 times
0.23762561299372464
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D22291236

Pulled By: VitalyFedyunin

fbshipit-source-id: 134dd08b77b11e631d914b5500ee4285b5d0591e
2020-08-03 11:14:57 -07:00
Hong Xu
91c80d122a torch.gcd: Do not use std::abs() because it does not have an unsigned integer overload (#42254)
Summary:
`abs` doesn't have an signed overload across all compilers, so applying abs on uint8_t can be ambiguous: https://en.cppreference.com/w/cpp/numeric/math/abs

This may cause unexpected issue when the input is uint8 and is greater
than 128. For example, on MSVC, applying `std::abs` on an unsigned char
variable

```c++
#include <cmath>

unsigned char a(unsigned char x) {
    return std::abs(x);
}
```

gives the following warning:

    warning C4244: 'return': conversion from 'int' to 'unsigned char',
    possible loss of data

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42254

Reviewed By: VitalyFedyunin

Differential Revision: D22860505

Pulled By: mruberry

fbshipit-source-id: 0076d327bb6141b2ee94917a1a21c22bd2b7f23a
2020-08-01 23:03:33 -07:00
Mike Ruberry
2912390662 Limits cpu scalar error message to where it's appropriate (#42360)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40986.

TensorIterator's test for a CUDA kernel getting too many CPU scalar inputs was too permissive. This update limits the check to not consider outputs and to only be performed if the kernel can support CPU scalars.

A test is added to verify the appropriate error message is thrown in a case where the old error message was thrown previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42360

Reviewed By: ngimel

Differential Revision: D22868536

Pulled By: mruberry

fbshipit-source-id: 2bc8227978f8f6c0a197444ff0c607aeb51b0671
2020-08-01 02:04:30 -07:00
Kurt Mohler
206db5c127 Improve torch.norm functionality, errors, and tests (#41956)
Summary:
**BC-Breaking Note:**
BC breaking changes in the case where keepdim=True. Before this change, when calling `torch.norm` with keepdim=True and p='fro' or p=number, leaving all other optional arguments as their default values, the keepdim argument would be ignored. Also, any time `torch.norm` was called with p='nuc', the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. After the change, for each of these cases, the result has the same number and order of dimensions as the input.

**PR Summary:**

* Fix keepdim behavior
* Throw descriptive errors for unsupported sparse norm args
* Increase unit test coverage for these cases and for complex inputs

These changes were taken from part of PR https://github.com/pytorch/pytorch/issues/40924. That PR is not going to be merged because it overrides `torch.norm`'s interface, which we want to avoid. But these improvements are still useful.

Issue https://github.com/pytorch/pytorch/issues/24802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41956

Reviewed By: albanD

Differential Revision: D22837455

Pulled By: mruberry

fbshipit-source-id: 509ecabfa63b93737996f48a58c7188b005b7217
2020-08-01 01:55:12 -07:00
Mike Ruberry
2f840b1662 Warns when TensorIterator would resize its output (#42079)
Summary:
See https://github.com/pytorch/pytorch/issues/41027.

This adds a helper to resize output to ATen/native/Resize.* and updates TensorIterator to use it. The helper throws a warning if a tensor with one or more elements needs to be resized. This warning indicates that these resizes will become an error in a future PyTorch release.

 There are many functions in PyTorch that will resize their outputs and don't use TensorIterator. For example,

985fd970aa/aten/src/ATen/native/cuda/NaiveConvolutionTranspose2d.cu (L243)

And these functions will need to be updated to use this helper, too. This PR avoids their inclusion since the work is separable, and this should let us focus on the function and its behavior in review. A TODO appears in the code to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42079

Reviewed By: VitalyFedyunin

Differential Revision: D22846851

Pulled By: mruberry

fbshipit-source-id: d1a413efb97e30853923bce828513ba76e5a495d
2020-07-30 22:39:16 -07:00
Mike Ruberry
e54f268a7a Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: albanD

Differential Revision: D22836802

Pulled By: mruberry

fbshipit-source-id: 33dfbe4d4067800c418b314b1f60fab8adcab4e7
2020-07-30 22:39:13 -07:00
kshitij12345
31d41f987a torch.where : Scalar Support (#40336)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/38349 #9190

TODO
* [x] Add Tests
* [x] Update Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40336

Reviewed By: albanD

Differential Revision: D22813834

Pulled By: mruberry

fbshipit-source-id: 67c1693c059a301b249213afee3c25cea9f64fec
2020-07-30 22:36:53 -07:00
Hong Xu
344defc973 Let bfloat16 support promotion with other types (#41698)
Summary:
Fix https://github.com/pytorch/pytorch/issues/40580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41698

Reviewed By: albanD

Differential Revision: D22824042

Pulled By: mruberry

fbshipit-source-id: 7dad9c12dc51d8f88c3ca963ae9c5f8aa2f72277
2020-07-30 12:28:09 -07:00
kiyosora
26d58503c2 Implementing NumPy-like function torch.signbit() (#41589)
Summary:
- Related with https://github.com/pytorch/pytorch/issues/38349
- Implementing the NumPy-like function `torch.signbit()` .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41589

Reviewed By: albanD

Differential Revision: D22835249

Pulled By: mruberry

fbshipit-source-id: 7988f7fa8f591ce4b6a23ac884ee7b3aa718bcfd
2020-07-30 11:21:15 -07:00
Mike Ruberry
4b6e5f42a4 Creates spectral ops test suite (#42157)
Summary:
In preparation for creating the new torch.fft namespace and NumPy-like fft functions, as well as supporting our goal of refactoring and reducing the size of test_torch.py, this PR creates a test suite for our spectral ops.

The existing spectral op tests from test_torch.py and test_cuda.py are moved to test_spectral_ops.py and updated to run under the device generic test framework.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42157

Reviewed By: albanD

Differential Revision: D22811096

Pulled By: mruberry

fbshipit-source-id: e5c50f0016ea6bb8b093cd6df2dbcef6db9bb6b6
2020-07-29 11:36:18 -07:00
Alban Desmaison
460970483d Revert D22790718: [pytorch][PR] Enables torch.full bool and integer type inference
Test Plan: revert-hammer

Differential Revision:
D22790718 (6b3f335641)

Original commit changeset: 8d1eb01574b1

fbshipit-source-id: c321177cce129a6c83f1a7b26bd5ed94a343ac0f
2020-07-29 07:52:04 -07:00
Xiong Wei
90074bbfa6 implement numpy-like functionality isposinf, isneginf (#41588)
Summary:
Related https://github.com/pytorch/pytorch/issues/38349

Numpy-like functionalities `isposinf` and `isneginf` are implemented.

Test-Plan:
- pytest test/test_torch.py -k "test_isposinf_isneginf"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41588

Reviewed By: ngimel

Differential Revision: D22770732

Pulled By: mruberry

fbshipit-source-id: 7448653e8fb8df6b9cd4604a4739fe18a1135578
2020-07-29 03:29:31 -07:00
Mike Ruberry
6b3f335641 Enables torch.full bool and integer type inference (#41912)
Summary:
After being deprecated in 1.5 and throwing a runtime error in 1.6, we can now enable torch.full inferring its dtype when given bool and integer fill values. This PR enables that inference and updates the tests and docs to reflect this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41912

Reviewed By: pbelevich

Differential Revision: D22790718

Pulled By: mruberry

fbshipit-source-id: 8d1eb01574b1977f00bc0696974ac38ffdd40d9e
2020-07-28 23:11:08 -07:00
Hong Xu
2de549518e Make fmod work with zero divisors consistently (#41948)
Summary:
Currently `torch.tensor(1, dtype=torch.int).fmod(0)` crashes (floating point exception).

This PR should fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41948

Reviewed By: ngimel

Differential Revision: D22771081

Pulled By: ezyang

fbshipit-source-id: a94dd35d6cd85daa2d51cae8362004e31f97989e
2020-07-28 08:58:39 -07:00
Natalia Gimelshein
6ca5421a8f Enable non-synchronizing cub scan for cum* operations (#42036)
Summary:
This uses cub for cum* operations, because, unlike thrust, cub is non-synchronizing.
Cub does not support more than `2**31` element tensors out of the box (in fact, due to cub bugs the cutoff point is even smaller)
so to support that I split the tensor into `2**30` element chunks, and modify the first value of the second and subsequent chunks to contain the cumsum result of the previous chunks. Since modification is done inplace on the source tensor, if something goes wrong and we error out before the source tensor is reverted back to its original state, source tensor will be corrupted, but in most cases errors will invalidate the full coda context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42036

Reviewed By: ajtulloch

Differential Revision: D22749945

Pulled By: ngimel

fbshipit-source-id: 9fc9b54d466df9c8885e79c4f4f8af81e3f224ef
2020-07-27 15:44:03 -07:00
Mike Ruberry
12cd083fd7 Updates torch.tensor, torch.as_tensor, and sparse ctors to use the device of inputs tensors they're given, by default (#41984)
Summary:
**BC-Breaking Note**

This PR changes the behavior of the torch.tensor, torch.as_tensor, and sparse constructors. When given a tensor as input and a device is not explicitly specified, these constructors now always infer their device from the tensor. Historically, if the optional dtype kwarg was provided then these constructors would not infer their device from tensor inputs. Additionally, for the sparse ctor a runtime error is now thrown if the indices and values tensors are on different devices and the device kwarg is not specified.

**PR Summary**
This PR's functional change is a single line:

```
auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key)));
```
=>
```
auto device = device_opt.has_value() ? *device_opt : var.device();
```

in `internal_new_from_data`. This line entangled whether the function was performing type inference with whether it inferred its device from an input tensor, and in practice meant that

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t, dtype=torch.float64)
```

would return a tensor on the CPU, not the default CUDA device, while

```
t = torch.tensor((1, 2, 3), device='cuda')
torch.tensor(t)
```

would return a tensor on the device of `t`!

This behavior is niche and odd, but came up while aocsa was fixing https://github.com/pytorch/pytorch/issues/40648.

An additional side affect of this change is that the indices and values tensors given to a sparse constructor must be on the same device, or the sparse ctor must specify the dtype kwarg. The tests in test_sparse.py have been updated to reflect this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41984

Reviewed By: ngimel

Differential Revision: D22721426

Pulled By: mruberry

fbshipit-source-id: 909645124837fcdf3d339d7db539367209eccd48
2020-07-25 02:49:45 -07:00
Natalia Gimelshein
750d9dea49 move min/max tests to TestTorchDeviceType (#41908)
Summary:
so that testing _min_max on the different devices is easier, and min/max operations have better CUDA test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41908

Reviewed By: mruberry

Differential Revision: D22697032

Pulled By: ngimel

fbshipit-source-id: a796638fdbed8cda90a23f7ff4ee167f45530914
2020-07-23 22:49:30 -07:00
Vishwak Srinivasan
77db93228b Temporary fix for determinant bug on CPU (#35136)
Summary:
Changelog:
- Make diagonal contiguous

Temporarily Fixes https://github.com/pytorch/pytorch/issues/34061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/35136

Reviewed By: izdeby

Differential Revision: D22673153

Pulled By: ezyang

fbshipit-source-id: 850f537483f929fcb43bcdef9d4ec264a7c3d354
2020-07-23 10:12:06 -07:00
kshitij12345
266657182a Add torch.movedim (#41480)
Summary:
https://github.com/pytorch/pytorch/issues/38349 #36048

TODO:
* [x] Tests
* [x] Docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41480

Reviewed By: zhangguanheng66

Differential Revision: D22649917

Pulled By: zou3519

fbshipit-source-id: a7f3920a24bae16ecf2ad731698ca65ca3e8c1ce
2020-07-23 09:41:01 -07:00
ashishfarmer
586b7f991c Enable skipped tests from test_torch on ROCm (#41611)
Summary:
This pull request enables the following tests from test_torch, previously skipped on ROCm:
test_pow_-2_cuda_float32/float64
test_sum_noncontig_cuda_float64
test_conv_transposed_large

The first two tests experienced precision issues on earlier ROCm version, whereas the conv_transposed test was hitting a bug in MIOpen which is fixed with the version shipping with ROCm 3.5

ezyang jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41611

Reviewed By: xw285cornell

Differential Revision: D22672690

Pulled By: ezyang

fbshipit-source-id: 5585387c048f301a483c4c0566eb9665555ef874
2020-07-22 19:49:17 -07:00
Nikita Vedeneev
7fefa46820 scatter/gather - check that inputs are of the same dimensionality (#41672)
Summary:
As per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41672

Reviewed By: malfet, ngimel

Differential Revision: D22678302

Pulled By: gchanan

fbshipit-source-id: 95a1bde81e660b8963e5914d5348fd4fbff1338e
2020-07-22 18:51:51 -07:00
Kurt Mohler
ec683299eb Reland Add non-deterministic alert to CUDA operations that use atomicAdd() (#41538)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/40056

A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41538

Reviewed By: zou3519

Differential Revision: D22608376

Pulled By: ezyang

fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
2020-07-22 13:12:29 -07:00
Gregory Chanan
71aad6ea66 Revert "port masked_select from TH to ATen and optimize perf on CPU (#33269)" (#41828)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41828

This reverts commit fe66bdb498.

This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap.

Test Plan: Imported from OSS

Reviewed By: orionr

Differential Revision: D22657473

Pulled By: malfet

fbshipit-source-id: 95a806cedf1a3f4df91e6a21de1678252b117489
2020-07-22 09:28:04 -07:00
Vasiliy Kuznetsov
302e566205 add max_and_min function and cpu kernel to speed up observers (#41570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41570

For min/max based quantization observers, calculating min and max of a tensor
takes most of the runtime. Since the calculation of min and max is done
on the same tensor, we can speed this up by only reading the tensor
once, and reducing with two outputs.

One question I had is whether we should put this into the quantization
namespace, since the use case is pretty specific.

This PR implements the easier CPU path to get an initial validation.
There is some needed additional work in future PRs, which durumu will
take a look at:
* CUDA kernel and tests
* making this work per channel
* benchmarking on observer
* benchmarking impact on QAT overhead

Test Plan:
```
python test/test_torch.py TestTorch.test_min_and_max
```

quick bench (not representative of real world use case):
https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca
```
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.0390) tensor(-5.4485) tensor([-5.4485,  5.0390])
min and max separate 11.90243935585022
min and max combined 6.353186368942261
% decrease 0.466228209277153
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.5586) tensor(-5.3983) tensor([-5.3983,  5.5586])
min and max separate 3.468616485595703
min and max combined 1.8227086067199707
% decrease 0.4745142294372342
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.2146) tensor(-5.2858) tensor([-5.2858,  5.2146])
min and max separate 1.5707778930664062
min and max combined 0.8645427227020264
% decrease 0.4496085496757899
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D22589349

fbshipit-source-id: c2e3f1b8b5c75a23372eb6e4c885f842904528ed
2020-07-21 18:16:22 -07:00
Wojciech Baranowski
48569cc330 Reland split (#41567)
Summary:
Take 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41567

Reviewed By: zou3519

Differential Revision: D22586331

Pulled By: albanD

fbshipit-source-id: ca08199da716d64a335455610edbce752fee224b
2020-07-21 08:06:27 -07:00