Commit Graph

999 Commits

Author SHA1 Message Date
Vitaly Fedyunin
02917dd1f4 Add memory format support to randint_like operator (#27889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27889

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17980307

Pulled By: VitalyFedyunin

fbshipit-source-id: f1766c2bcb015ef870bfb92c16b4cd363b3cbc14
2019-10-25 07:29:20 -07:00
Vitaly Fedyunin
c258cd039a Add memory format support to zeros_like operator (#27562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27562

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17980313

Pulled By: VitalyFedyunin

fbshipit-source-id: 9ca8453dc1a554ceea93c6949e01263cc576384b
2019-10-25 07:29:16 -07:00
Vitaly Fedyunin
04f5325583 Add memory format support to rand_like operator (#27561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27561

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17980316

Pulled By: VitalyFedyunin

fbshipit-source-id: 2a1d47571268673de0c6f5ae1b6d4f9110962ab0
2019-10-25 07:29:12 -07:00
Vitaly Fedyunin
2c339a24ec Add memory format support to ones_like operator (#27270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27270

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17980312

Pulled By: VitalyFedyunin

fbshipit-source-id: 5da9530f6b239306dbb66d1dfeefe88237f13bbd
2019-10-25 07:29:08 -07:00
Vitaly Fedyunin
85d5aee863 Add memory format support to full_like operator (#27262)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27262

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17980309

Pulled By: VitalyFedyunin

fbshipit-source-id: 1761a9939aa7c5ab23e927b897e25e225089a8e7
2019-10-25 07:29:04 -07:00
Vitaly Fedyunin
baf8488dbd Add memory format support to empty_like operator (#27244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27244

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D17980310

Pulled By: VitalyFedyunin

fbshipit-source-id: 00a39b40daa4b8ee63c32e60d920222f8be2d6a1
2019-10-25 07:29:00 -07:00
Gregory Chanan
2793d41a9c Fix scalar handling of unfold. (#28462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28462

Unfold is implemented in TH (as _th_unfold), and uses the standard scalar checks.  That means, even though torch.tensor(5).unfold(dim=0, size=1, step=1) should produce:
torch.tensor([5]), it actually produces torch.tensor(5) because the scalar_check infers it's a scalar.

We can fix this by just turning off the scalar_check.

Test Plan: Imported from OSS

Differential Revision: D18074671

Pulled By: gchanan

fbshipit-source-id: 5db09d614692830d66d6e6d8aba799ebe8144cf5
2019-10-25 07:12:18 -07:00
Hong Xu
5cf644157c Speed up fill for half and bfloat16 on CPU. (#28397)
Summary:
This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all
sorts of AVX optimization available.

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136):

```python
import timeit
for dtype in ('torch.bfloat16', 'torch.half'):
    for n, t in [(40_000, 600000),
                (400_000, 60000)]:
        print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})')
        print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t))
```

Before:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
11.064065577999827
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
10.618151295000189
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
10.989039544000207
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
10.602233665999847
```

After:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
1.530125006000162
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
1.4807136570002513
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
1.3946152990001792
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
1.457788402999995
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28397

Differential Revision: D18125171

Pulled By: ezyang

fbshipit-source-id: bfb2da13f10bc582e9848073e428af9e36656b13
2019-10-24 15:03:11 -07:00
Anjali Chourdia
136bb07a93 torch.histc added a finite range check to resolve segfaults if tensor has inf. also added checks for nan values, min>max (#27712)
Summary:
https://github.com/pytorch/pytorch/issues/27464
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27712

Differential Revision: D18064544

Pulled By: anjali411

fbshipit-source-id: c9c6d8eb4d55f2b5320409ba238bf44b0be8902e
2019-10-24 09:28:45 -07:00
Gregory Chanan
4f0a3504e1 Port is_set_to from TH/THC to ATen.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28425

Test Plan: Imported from OSS

Differential Revision: D18063328

Pulled By: gchanan

fbshipit-source-id: 86af01a630d88c30947b8c85d1fac86dd7b40585
2019-10-24 09:15:03 -07:00
Gregory Chanan
bee4aca259 is_set_to: unify TH/THC implmentation and genericize test_is_set_to. (#28422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28422

The TH implementation had two differences:
1) It explicitly checked for null storages; this isn't supported anymore so can be removed.
2) It collapsed all empty tensors to the same shape for the purpose of checking.  This was introduced to keep BC when we introduced N-dimensional empty tensors,
but since it's been quite a long time since we've had N-dimensional empty tensors and the CUDA implementation didn't support this, we should get rid of it.

Test Plan: Imported from OSS

Differential Revision: D18061916

Pulled By: gchanan

fbshipit-source-id: 1a54cf9ea4fcb35b358a9ab57f84eff059ff1e7b
2019-10-23 13:46:52 -07:00
vishwakftw
657430e1f0 Return 0-numel empty tensor from symeig when eigenvectors=False (#28338)
Summary:
Changelog:
- Changes the behavior of returning a zero tensor when eigenvectors=False, matching behavior of torch.eig
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28338

Test Plan: - test_symeig has been modified appropriately for this change

Differential Revision: D18085280

Pulled By: ezyang

fbshipit-source-id: 43129a96dd01743997157974100e5a7270742b46
2019-10-23 11:44:57 -07:00
Hong Xu
19aeb472aa Move the CUDA implementation of log1p to ATen. (#26923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26923

Fix #24588

Test Plan: Imported from OSS

Differential Revision: D17984184

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bc2be4f08e800b1de274940f2bd3d5b418b45ee
2019-10-22 14:00:59 -07:00
iurii zdebskyi
5e73e1fff8 Enabled torch.unique for bool tensors (#28374)
Summary:
Enabled torch.unique for bool tensors.
Tested via unit tests

[issue](https://github.com/pytorch/pytorch/issues/27691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28374

Differential Revision: D18043413

Pulled By: izdeby

fbshipit-source-id: 295ff03b9b61d33bbd2e05e6211c4f35a0ee23ea
2019-10-22 09:09:46 -07:00
Hong Xu
9cb003a94f Add typing check of alpha for torch.sub and code clean up.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28298

Differential Revision: D18017923

Pulled By: nairbv

fbshipit-source-id: 2c4b3f96eb005dfb70e1b7ff87d28eb79b9300dd
2019-10-18 14:49:42 -07:00
Igor Fedan
12dde7f58a cdist performance improvement for euclidean distance (#25799)
Summary:
jacobrgardner https://github.com/pytorch/pytorch/issues/15253#issuecomment-491467128 preposed a way to speedup euclidean distance calculation. This PR is implementation of this solution for normal and batch version.

Also simonepri provided performance metrics https://github.com/pytorch/pytorch/issues/15253#issuecomment-502363581
![image](https://user-images.githubusercontent.com/12058312/64460756-44a24580-d0c9-11e9-9f7f-a5942f4c832d.png)

Current implementation has speedup comparing to jacobrgardner approach
![image](https://user-images.githubusercontent.com/12058312/64461495-5553bb00-d0cb-11e9-87e6-302b8cc7e12b.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25799

Differential Revision: D17964982

Pulled By: ifedan

fbshipit-source-id: bf7bd0dbfca51fd39e667da55139347480f30a2f
2019-10-17 14:56:54 -07:00
Vitaly Fedyunin
951dd03037 Add memory format support to typecasting shortcuts byte,char,double,bool,half,int,long,short,float,bfloat16 (#27228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27228

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17980315

Pulled By: VitalyFedyunin

fbshipit-source-id: fd5615621bc4968aa4ef2a26430c492c552ed671
2019-10-17 09:16:25 -07:00
Vitaly Fedyunin
15df371934 Add memory format support to typecasting shortcuts byte,char,double,bool,half,int,long,short,float,bfloat16 (#27228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27228

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17980128

Pulled By: VitalyFedyunin

fbshipit-source-id: b2646bab72c4475b7a82bb271d204a9d96d28bd4
2019-10-17 09:16:21 -07:00
Hong Xu
4a69d048e0 Move the CUDA implementation of log2 to ATen. (#26769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26769

Fix #24589

Test Plan: Imported from OSS

Differential Revision: D17960122

Pulled By: VitalyFedyunin

fbshipit-source-id: 58dff236886bbf3a0a152d7422aa8a5c478ee1de
2019-10-17 07:27:55 -07:00
Mike Ruberry
edc28676ef Adds @overridePrecision decorator (#28131)
Summary:
Adds the overridePrecision decorator, which allows device generic tests to specify per-dtype precision overrides.

Precision is overridden on the test class instance itself, and so is thread-local (so that running multiple tests in parallel will not conflict). It can be accessed directly from a test with self.precision, as before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28131

Differential Revision: D17969774

Pulled By: mruberry

fbshipit-source-id: c4e0b71afac6bdc7cbf4e799f3054922de764820
2019-10-16 19:47:55 -07:00
Natalia Gimelshein
97257e257e clean up test_cat_empty (#28115)
Summary:
Remove spurious parts from test_cat_empty
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28115

Test Plan: no additional tests needed.

Differential Revision: D17956669

Pulled By: ngimel

fbshipit-source-id: cffcfa9e5b50afba62c6dbc8ca5d9de95d0c020e
2019-10-16 14:42:14 -07:00
Hong Xu
cbb4c87d43 Improve the doc and test of logical_xor (#28031)
Summary:
Following up https://github.com/pytorch/pytorch/issues/27248. per suggestion by gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28031

Differential Revision: D17962226

Pulled By: gchanan

fbshipit-source-id: 788e4e1fc78b1cfc7915aedaa10c8656b19edc4d
2019-10-16 13:57:53 -07:00
Vitaly Fedyunin
d39ab0312a Add memory_format support to and type operators (#27107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27107

Adds memory_format keyword argument (positional for cpp).

'Preserve' behavior now follows next rules:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17931062

Pulled By: VitalyFedyunin

fbshipit-source-id: 2c5dd3dd05bf58a9a29f25562cd45190b009c3f9
2019-10-15 12:55:56 -07:00
Hong Xu
e6a71405a0 Let logical_xor support non-bool tensors (again) (#27248)
Summary:
f362a5a04b reverted
5ca612b55e due to build time conerns (also
see https://github.com/pytorch/pytorch/issues/25254). Now we come back to this by reusing the underlying code in
comparison operators: Logical operators on non-bool variables are
essentially comparison operators that semantically output bool
values. Compared with the previous implementation, we compromise by
always applying XOR on the same input type, while output can be either
the input type or the bool type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27248

Differential Revision: D17929356

Pulled By: ezyang

fbshipit-source-id: dbac08c7614b36f05d24c69104fee9df9ca523d5
2019-10-15 10:56:32 -07:00
vishwakftw
ad47788647 Add Polygamma to the docs (#27696)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25347
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27696

Differential Revision: D17916790

Pulled By: ezyang

fbshipit-source-id: ac2635a300b1ef0ab437e3ffac152239754fe828
2019-10-15 07:00:57 -07:00
vishwakftw
82a69a690f Add documentation for torch.lgamma (#27812)
Summary:
Changelog:
- Add doc string in _torch_docs.py, _tensor_docs.py
- Expose in docs/source/torch.rst, docs/source/tensors.rst
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27812

Test Plan:
- Remove `lgamma`, `lgamma_` from the blacklist

Fixes https://github.com/pytorch/pytorch/issues/27783

Differential Revision: D17907630

Pulled By: ezyang

fbshipit-source-id: 14e662a4e5262126889a437e5c4bfb21936730e8
2019-10-14 08:47:04 -07:00
Mike Ruberry
fab48eb200 Makes some CPU-only tests in test_torch generic (#27688)
Summary:
Per title. Also testing putting test_advancedindex back on the default stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27688

Differential Revision: D17888351

Pulled By: mruberry

fbshipit-source-id: af8adeca89f575fc276921b39049b07135ed9776
2019-10-11 17:13:41 -07:00
t-kuha
b6fea4f77f Removes floating_dtype decorator from test_torch and test_cuda (#27599)
Summary:
Per title. Also makes a few test_torch tests generic.

This PR removes ~half the floating_dtype decorators. Follow-up will remove the rest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27599

Differential Revision: D17840056

Pulled By: mruberry

fbshipit-source-id: 428bb5498c452083e3608325e0b548b1d75baf2d
2019-10-09 16:10:26 -07:00
zou3519
59b14a7620 Documentation for named tensors (#27173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27173

`docs/source/named_tensor.rst` is the entry point; most users will land
either here or the named tensor tutorial when looking to use named
tensors. We should strive to make this as readable, concise, and understandable
as possible.

`docs/source/name_inference.rst` lists all of the name inference rules.
It should be clear but it's hard to make it concise.

Please let me know if anything doesn't make sense and please propose
alternative wordings and/or restructuring to improve the documentation.
This should ultimately get cherry-picked into the 1.3 branch as one
monolithic commit so it would be good to get all necessary changes made
in this PR and not have any follow ups.

Test Plan: - built and reviewed locally with `cd docs/ && make html`.

Differential Revision: D17763046

Pulled By: zou3519

fbshipit-source-id: c7872184fc4b189d405b18dad77cad6899ae1522
2019-10-08 22:22:30 -07:00
Mike Ruberry
7f183a978f Stops common_utils.py from setting the default tensor type (to torch.DoubleTensor) (#27444)
Summary:
This PR stop common_utils.py from setting the default tensor type when it's imported. See issue https://github.com/pytorch/pytorch/issues/27355. This is a frequent source of confusion for test writers.

Many tests relied on this setting (whether they knew it or not), and this PR also updates the test suite to pass without common_utils.py setting the default tensor type. Some larger test files now set the default floating dtype themselves, however. These test files are:

- test_autograd.py
- test_distributions.py
- test_jit.py
- test_nn.py

This is still a significant improvement from today, however. First, these files set the default floating dtype much more clearly than importing it from common_utils. Second, the rest of the test suite no longer sets this globally. Third, this PR is a springboard to updating those tests, too. In particular, as tests are made generic they can be moved aways from relying on this global setting.

Notable technical changes in this PR are:

- Significant updates to test_torch.py to make it pass without setting the default floating dtype globally.
- The default_floating_dtype decorator is now defined in common_utils, a couple versions of this operator were defined in test files previously.
- test_torch-specific parts of common_utils were refactored into test_torch.
- tensor creation methods in common_utils were updated to accept an optional dtype and device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27444

Differential Revision: D17795235

Pulled By: mruberry

fbshipit-source-id: 7f77271c0c836e69f183ad9057a2c4b29f09d2e1
2019-10-08 09:52:44 -07:00
Mike Ruberry
a7de545c63 Makes test_cuda.py's generated tensor op tests generic (#27210)
Summary:
- The tensor op tests generated in test_cuda.py are now generic and appear in test_torch,py
- Data previously held in auxiliary data structures and files, like test_cuda_ignores.txt, is inlined

Previously the tensor op tests used several auxiliary data structures, a file, and exception handling to filter the test suite. If a function wasn't implemented, for example, that exception would be caught. This let functions like trigamma, which isn't callable, appear to be tested. See https://github.com/pytorch/pytorch/issues/27230. Filtering from additional data stores is error prone, too. It requires developers understand what data stores are used and how they're used. The existing sources are also sometimes incorrect. The txt file claims that dist_ doesn't work on half tensors, for example, but the updated tests verify it does.

In addition to making these tests generic, this PR removes those auxiliary data structures and does not catch any exceptions. Exceptions are errors. (This also means that if something implemented breaks it will now report as an error. Previously the test suite would have reported a pass.) The test infrastructure was also simplified to not perform computations with CPU half tensors since they do not support many operations. This introduces a float<->half conversion quirk but eliminates awkward functions that would first convert cpu tensors to float, perform an operation, and convert them back.

With this change test_cuda.py is almost entirely CUDA-specific.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27210

Differential Revision: D17757907

Pulled By: mruberry

fbshipit-source-id: b3c191c379667b1a7d5361087bdf82f397f77f65
2019-10-04 02:40:59 -07:00
Junjie Bai
76f847546b Enable Python3.6 PyTorch ROCm CI
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27353

Differential Revision: D17758495

Pulled By: bddppq

fbshipit-source-id: 95e329bc30f092e4093a33c408f1647b803d9983
2019-10-04 00:23:37 -07:00
Hong Xu
2e62318243 Move the CUDA implementation of log10 to ATen. (#26733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26733

Close #24587

Test Plan: Imported from OSS

Differential Revision: D17606981

Pulled By: VitalyFedyunin

fbshipit-source-id: 732f07b981287da3ca235b272b7b6f78144f8ebe
2019-10-03 14:54:20 -07:00
Vitaly Fedyunin
7b2e8c323c Add memory format argument to the clone operator (#27106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27106

Adds memory_format option to the `clone` operator.

Introduce new `clone` behavior if used with `input_t.clone(memory_format=torch.preserve_format)`:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17699357

Pulled By: VitalyFedyunin

fbshipit-source-id: 5ae1537c2aca1abf0bf1eec4416846129c156f66
2019-10-03 12:08:47 -07:00
Mike Ruberry
b45f1b9601 Makes more of test_cuda.py generic and updates test_torch tests (#27135)
Summary:
- Makes more of test_cuda generic, including some serialization tests
- Updates some tests in test_torch to use latest extensibility points and patterns

Most remaining tests in test_cuda.py are either generated (to be moved in a follow-up PR) or deal with CUDA-specific features like streams, events, and querying CUDA devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27135

Differential Revision: D17696478

Pulled By: mruberry

fbshipit-source-id: 51ae424c8a72e725556a2f2bc92ad9a87244b3c0
2019-10-01 19:18:56 -07:00
peter
ec07d144ba Fixed seek offset size to 64bit. (#27125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26998.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27125

Differential Revision: D17687154

Pulled By: ezyang

fbshipit-source-id: 6784f4fd799130ac72a25884f120a0ba96bd4f51
2019-10-01 08:50:32 -07:00
Mike Ruberry
ea414e4990 Adds Device Generic Precision Tests to test_torch.py (#26762)
Summary:
- Lets device generic classes be instantiated for all available device types EXCEPT those specified
- Creates TestDevicePrecision in test_torch.py, letting devices compare their results to the CPU's
- Moves 4 functions from test_cuda.py to TestDevicePrecision
- polygamma and digamma functions were cleaned up

The polygamma and digamma tests always ran with double tensors and will fail when using float tensors, despite former comments and code to the contrary. Notes were added to each function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26762

Differential Revision: D17677859

Pulled By: mruberry

fbshipit-source-id: 7cbe7d05ee0bc9b622c9127be36ced02f9c4506a
2019-09-30 19:09:21 -07:00
Mike Ruberry
ec7913afbd Cuts test_torch.py runtime in half by marking four tests as slow (#26789)
Summary:
- Adds slowTest to four tests

On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s.

test_sum_dim, for example, takes 30s but was not marked as slow.
test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s!
test_einsum takes 7s.
test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s.

Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26789

Differential Revision: D17574282

Pulled By: mruberry

fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b
2019-09-30 17:25:30 -07:00
Edward Yang
b16358b251 Revert D17666050: [pytorch][PR] Fixed seek offset size to 64bit.
Test Plan: revert-hammer

Differential Revision:
D17666050

Original commit changeset: f02ebd5320ae

fbshipit-source-id: 6bc8fe583e350e2b573f767af85d1287dd048d1f
2019-09-30 11:07:35 -07:00
Yoshiaki Nakamura
1afe3fc01e Fixed seek offset size to 64bit. (#27047)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/26998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27047

Differential Revision: D17666050

Pulled By: ezyang

fbshipit-source-id: f02ebd5320ae25f8949be20d0744fe3cd3e2fee9
2019-09-30 07:52:15 -07:00
Vitaly Fedyunin
275e0c1c8f Make nonzero non differentiable as it supposed to be (#26980)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/26038

Somewhere between v1.1 and master `nonzero` become `abstract` and was marked as differentiable (by mistake) we need to but them into TH section of `tools/autograd/derivatives.yaml ` to fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26980

Differential Revision: D17632276

Pulled By: VitalyFedyunin

fbshipit-source-id: d6cabcc53348af6148cea5a1bd1af2ef12547373
2019-09-30 07:33:58 -07:00
Igor Fedan
ee2c79d699 Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#27017)
Summary:
https://github.com/pytorch/pytorch/pull/26981
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27017

Differential Revision: D17651454

Pulled By: ifedan

fbshipit-source-id: c6313caa11598a0ef160e1c6d2f3c33d03ce80c5
2019-09-28 15:08:41 -07:00
Mike Ruberry
8858f42aa4 Revert D17635651: [pytorch][PR] Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion.
Test Plan: revert-hammer

Differential Revision:
D17635651

Original commit changeset: 6ec7615207f5

fbshipit-source-id: 1bd5d01856aabd01ff6b472dfa636bcea91c60a5
2019-09-27 21:09:26 -07:00
Igor Fedan
541de7e140 Migrate le/gt/ge/eq/ne from the TH to Aten. Added support of type promotion. (#26981)
Summary:
https://github.com/pytorch/pytorch/issues/24606 Migrate ne and ne_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24740 Migrate ne and ne_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24573 Migrate gt and gt_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24709 Migrate gt and gt_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24556 Migrate eq and eq_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24696 Migrate eq and eq_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24568 Migrate ge and ge_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24703 Migrate ge and ge_ from the TH to Aten (CPU)
https://github.com/pytorch/pytorch/issues/24582 Migrate le and le_ from the TH to Aten (CUDA)
https://github.com/pytorch/pytorch/issues/24719 Migrate le and le_ from the TH to Aten (CPU)

Performance characteristics are similar to https://github.com/pytorch/pytorch/issues/25998

This PR migrates comparison ops from TH to ATen and adds type promotion in the same way as in https://github.com/pytorch/pytorch/issues/25998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26981

Differential Revision: D17635651

Pulled By: ifedan

fbshipit-source-id: 6ec7615207f5c248a6dd85fc54c25bd5e6d328e6
2019-09-27 17:28:56 -07:00
Dmytro Dzhulgakov
764bf826e3 Remove fbgemm_is_cpu_supported in favor of torch.backends.quantized.supported_qengines (#26840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26840

Cleaning up top-level namespace. Also cosmetic changes to torch.backends.quantized

Test Plan: Imported from OSS

Differential Revision: D17604403

Pulled By: dzhulgakov

fbshipit-source-id: c55af277ea7319d962a82a6120f65ccd47a60abc
2019-09-27 13:45:15 -07:00
Igor Fedan
f99bc714c7 Migrate lt and lt_ from the TH to Aten (#25998)
Summary:
https://github.com/pytorch/pytorch/issues/24593
https://github.com/pytorch/pytorch/issues/24727

**torch.lt(Tensor a, Tensor b)**
will compute common dtype (highest) based on inputs and then compare values. The result will be Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> x < y
tensor([True])
```
Previously it was impossible to make comparison of two tensors with different dtype.

**torch.lt(Tensor a, Tensor b, out=c)**
will compute common dtype (highest) based on inputs and then compare values. The result can be populated only to Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> z = torch.empty([1], dtype=torch.bool)
>>> torch.lt(x, y, out=z)
tensor([True])
```
Previously it was impossible to make comparison of two tensors with different dtype. Also previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result.

**a.lt_(Tensor b)**
Expects that a and b has same dtype, otherwise it's possible to get an overflow(Example: 'a' is uint8, 'b' is float32. 'a' will be promoted to float32 and the result will be also float32. Then it will be casted back to uint8 so potential for overflow). Will not compute common dtype. Result will have type of a.
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> x < y
tensor([True])
```
Works similar to previous implementation.

**torch.lt(Tensor a, Scalar b)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare.
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> x < 0.5
tensor([True])

>>> x = torch.tensor([0], dtype=torch.int)
>>> x < 0.5
tensor([True])
```
Fix https://github.com/pytorch/pytorch/issues/22301.

**torch.lt(Tensor a, Scalar b, out=c)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. The result can be populated only to Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> torch.lt(x, 0.5, out=z)
tensor([True])
```
Previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result. The rest works similar to previous implementation.

**torch.lt_(Tensor a, Scalar b)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. Result will have type of a.
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> x.lt_(1)
tensor([1], dtype=torch.int32)
>>> x = torch.tensor([0], dtype=torch.int)
>>> x.lt_(1.0)
tensor([1], dtype=torch.int32)
```
Works similar to previous implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25998

Differential Revision: D17431853

Pulled By: ifedan

fbshipit-source-id: b5effc6a5d9b32da379395b32abc628b604faaf7
2019-09-26 16:05:27 -07:00
Hong Xu
9dd8a129de Fix Vec256<T>::abs() for floating point when applied on -0.0 (#26422)
Summary:
Currently when a Vec256<T> (base) object contains -0.0, Vec256<T>::abs()
would not produce 0.0, but -0.0 instead. This commit fixes this issue.
This bug will mostly affect CPUs without AVX support, such as ARM,
PowerPC,  and older Intel models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26422

Differential Revision: D17607346

fbshipit-source-id: e8d4595f0e88ad93018a61f89b9e3dcada485358
2019-09-26 15:55:55 -07:00
Ethan Steinberg
bf1d957dc8 Fix the Bernoulli distribution sampler (#26864)
Summary:
The current Bernoulli distribution sampler is slightly off in that it returns true slightly too often. This is most obvious at very low p values, like p = 0, although it theoretically occurs at every probability. See  https://github.com/pytorch/pytorch/issues/26807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26864

Differential Revision: D17610459

Pulled By: ezyang

fbshipit-source-id: 28215ff820a6046822513f284793e7b850d38438
2019-09-26 14:14:57 -07:00
Hong Xu
91549ef6c8 Move the CUDA implementation of log to ATen. (#26494)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26494

Close #24586

Test Plan: Imported from OSS

Differential Revision: D17572497

Pulled By: VitalyFedyunin

fbshipit-source-id: e1bcd33021464eaa4affd4c6d3283c8403069945
2019-09-25 17:04:08 -07:00
nmilosev
5fc52482cf torch.load default encoding change to 'utf-8' (#26421)
Summary:
Default encoding when using torch.load to 'utf-8'

This commit provides changes for cases where user tries to torch.load
a pickled module with non-ASCII characters in the docstring as
discussed in https://github.com/pytorch/pytorch/issues/21743. The default encoding was changed from 'ascii'
to 'utf-8'. Documentation for `torch.load` was updated and two tests
(loading py2 unicode module with unicode in it; error throwing when
user explicitly sets wrong encoding) were written.

~~This commit provides changes for better error handling in cases
where user tries to `torch.load` a pickled module with non-ASCII
characters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743.~~

Ping ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26421

Differential Revision: D17581633

Pulled By: yf225

fbshipit-source-id: f8e77dcf7907092771149aad8ede6cfb73c21620
2019-09-25 14:59:02 -07:00
vishwakftw
aaf30cdf36 Port CUDA implementation of expm1 to ATen (#26598)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24562
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26598

Differential Revision: D17531503

Pulled By: VitalyFedyunin

fbshipit-source-id: 8119c796e142f073ad4e274dda1ad99344215c48
2019-09-25 11:11:58 -07:00
Mike Ruberry
25cd3c6b7d Lets generic tests use multiple devices (#26594)
Summary:
- Separates device type from default (test) device
- Adds multidevice decorator
- Updates generic tests to use multidevice decorator where applicable

TorchXLA wants to change the default test device based on the test environment. Separating the device type and the default (test) device enables that functionality.

Additionally, many existing tests only run on multiple devices and are required, as a consequence, to make CUDA-specific API calls. The multidevice decorator simplifies the existing code and limits the CUDA dependency. Eventually this should let us run multidevice tests on multiple device types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26594

Test Plan: tests were manually run with the CUDA test device set to 'cuda:1'.

Differential Revision: D17568910

Pulled By: mruberry

fbshipit-source-id: c442f748a31a970be8c21deb12a67c3b315c1128
2019-09-25 10:16:22 -07:00
Hong Xu
ae0732cde3 Speed up an integer to the power of a positive integer on CPU (#26020)
Summary:
Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is also
integral and the scalar is positive to speed up.

Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz	0	0:0	3300.00 MHz	, Debug
build, Turbo turned off):

```python
import timeit

for n, t in [(1000, 13000),
            (10_000, 1300)]:
    for e in (2, 3, 4):
        for dtype in ('torch.int16', 'torch.int32', 'torch.int64'):
            print(f'a.pow({e}) (a.numel() == {n}) for {t} times')
            print(f'dtype {dtype}, {t} times', end='\t\t')
            print(timeit.timeit(f'a.pow({e})',
                                setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})',
                                number=t))
```

Before:

```
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times		1.6958350749996498
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times		0.7989626339999631
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times		0.7973162800003593
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times		1.8660746679997828
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times		0.8101709959996697
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times		0.8135280149999744
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times		5.010833072999958
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times		4.801007671999741
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times		3.963344578000033
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times		1.6216251330001796
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times		0.5672429639998882
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times		0.5544572270000572
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times		1.656308512999658
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times		1.502670819999821
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times		0.5757876879997639
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times		4.775718216999849
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times		4.754745475000163
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times		3.737249878000057
```

After:

```
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times		1.1006453190002503
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times		1.0849009019998448
a.pow(2) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times		1.093259106000005
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times		1.0859826279997833
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times		1.1076840900000207
a.pow(3) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times		1.0755480369998622
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int16, 13000 times		1.918211066999902
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int32, 13000 times		1.9183043200000611
a.pow(4) (a.numel() == 1000) for 13000 times
dtype torch.int64, 13000 times		1.930021430999659
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times		0.7271483560002707
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times		0.7289002070001516
a.pow(2) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times		0.7267536800000016
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times		0.7301799359997858
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times		0.7289195180001116
a.pow(3) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times		0.7270008230002531
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int16, 1300 times		1.5354506029998447
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int32, 1300 times		1.528263066999898
a.pow(4) (a.numel() == 10000) for 1300 times
dtype torch.int64, 1300 times		1.5369428439998956
```

 ---

Best viewed with whitespace changes turned off
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26020

Differential Revision: D17485400

Pulled By: VitalyFedyunin

fbshipit-source-id: 3a16b074825a5aab0f7e7af3d8100f9e4b7011a3
2019-09-24 09:17:09 -07:00
Hong Xu
7bdc0c138a Move the CUDA implementation of trunc to ATen. (#25423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25423

Fix #24650

Test Plan: Imported from OSS

Differential Revision: D17397489

Pulled By: VitalyFedyunin

fbshipit-source-id: 933f915a44ff9b7803ddb2708bf0e723433ee0b6
2019-09-24 07:08:55 -07:00
Supriya Rao
45391ccecb Update qengine flag in python to string (#26620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26620

This change updates torch.backend.quantized.engine to accept string ("fbgemm"/"qnnpack"/"none" for now).
set_qengine and get_qengine return an int which represents the at::QEngine enum

Test Plan:
python test/test_torch.py

Imported from OSS

Differential Revision: D17533582

fbshipit-source-id: 5103263d0d59ff37d43dec27243cb76ba8ba633f
2019-09-23 17:56:50 -07:00
Edward Yang
fdf2bdef0c Revert D17450502: [pytorch][PR] [WIP] Enabled bfloat16 dtype on CUDA
Test Plan: revert-hammer

Differential Revision:
D17450502

Original commit changeset: 0a5acc5fe1b1

fbshipit-source-id: 6360e750e9805dc9c7c6ca8a9c16256ecd749416
2019-09-23 12:11:52 -07:00
Iurii Zdebskyi
76697a3bfc Enabled bfloat16 dtype on CUDA
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26407

Differential Revision: D17450502

Pulled By: izdeby

fbshipit-source-id: 0a5acc5fe1b1555c61ebe038aee9eaaae9dac228
2019-09-23 09:19:04 -07:00
Richard Zou
4fada96218 Renames tensor.renamed -> rename, tensor.names_ -> rename_ (#26548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26548

This makes the naming more consistent with PyTorch's API. The original
concern was that `tensor.rename` might make the operation seem like it
is in-place. However, we have many "verb" APIs: `tensor.add(other)`, for
example, doesn't add other to tensor in-place, but `tensor.add_(other)`
does.

`tensor.rename_` does exactly the same place as `tensor.rename`, but
in-place.

Test Plan: - [namedtensor ci]

Differential Revision: D17502021

Pulled By: zou3519

fbshipit-source-id: 6a5b93136a820075013cd1e30fb8fc6b9d77d7d9
2019-09-22 15:38:26 -07:00
Jerry Zhang
2667493f4c Expose supportedQEngines to python (#26474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26474

att

Test Plan:
python test/test_torch.py

Imported from OSS

Differential Revision: D17517373

fbshipit-source-id: af931761d6ee31a88808d05f686002a83b6b25af
2019-09-21 10:36:13 -07:00
Hong Xu
9ed6074827 Correct the test of a big number (2 ^ 31) (#26491)
Summary:
2 ^ 31 is 29, which is not a big number. Corrected to 2 ** 31.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26491

Differential Revision: D17494296

fbshipit-source-id: 83d320e8fb6d1b7df41e4474933a98107c8e4129
2019-09-20 19:14:55 -07:00
Vitaly Fedyunin
f55a9da00e Move the CUDA implementation of floor to ATen. (#25372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25372

Close #24617

Test Plan: Imported from OSS

Differential Revision: D17397478

fbshipit-source-id: 11a515235391ae796e2f84cde1913e56561c41bc
2019-09-20 13:15:29 -07:00
Edward Yang
9b7011c5c2 Implement multiple dispatch (#26468) (#26501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26501

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D17499154

Pulled By: ezyang

fbshipit-source-id: 8ea237c2e935134b0f4f8d6cfd89c6a93037c02c
2019-09-20 10:12:04 -07:00
Richard Zou
e2515a4d6d Allocate empty tensor instead of empty_like in binary ops, fix pow (#26498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26498

We should allocate an empty tensor as a result tensor when performing
binary ops. Currently some ops use `empty_like(self)` as the initial
result tensor before passing it into TensorIterator. This is not very
efficient because TensorIterator may resize the tensor due to
broadcasting, causing more memory allocation. By using an empty tensor
as the result tensor, we only need to allocate/resize memory once as
opposed to twice.

Also fixes https://github.com/pytorch/pytorch/issues/26495. The bug
there is that the implementation of `pow` is missing a resize in one
case.

Test Plan:
- new test
- run tests

Differential Revision: D17500025

Pulled By: zou3519

fbshipit-source-id: bff4949af5e75541c04669b961bcf2e1ec456faf
2019-09-20 07:38:08 -07:00
Michael Suo
5304358859 Revert D17481256: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17481256

Original commit changeset: b3206936b4ca

fbshipit-source-id: a162c42168c17e24b5eaff83a7aae48beef3d2c2
2019-09-19 14:53:40 -07:00
Edward Yang
0705f759a3 Implement multiple dispatch (#26468)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26468

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

XLA companion patch at https://github.com/pytorch/xla/pull/1031

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

The new generated code looks like this:

```
inline Tensor & Tensor::copy_(const Tensor & src, bool non_blocking) const {
    static auto table = globalATenDispatch().getOpTable("aten::copy_(Tensor(a!) self, Tensor src, bool non_blocking=False) -> Tensor(a!)");
    return table->getOp<Tensor & (Tensor &, const Tensor &, bool)>(at::detail::multi_dispatch_tensor_type_set(*this, src))(const_cast<Tensor&>(*this), src, non_blocking);
}
```

The key difference is that previously we wrote `type_set()` as argument to getOp; now it is a call to `multi_dispatch_tensor_type_set` which collects the type ids together.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bddppq

Differential Revision: D17481256

Pulled By: ezyang

fbshipit-source-id: b3206936b4ca8938d45ea90fd71422e0d80b5f96
2019-09-19 14:29:38 -07:00
iurii zdebskyi
f673def92d Enabled where for bool tensor on CUDA (#26430)
Summary:
Enabled "where_cuda" for bool tensors on CUDA
Fixing https://github.com/pytorch/pytorch/issues/26247
Tested via unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26430

Differential Revision: D17464181

Pulled By: izdeby

fbshipit-source-id: cbb09925753b2e6f35e7400da3243d4d3fc86b69
2019-09-19 12:29:31 -07:00
Junjie Bai
07bd76988e Revert D17265918: Implement multiple dispatch
Test Plan: revert-hammer

Differential Revision:
D17265918

Original commit changeset: 221efe4e86a4

fbshipit-source-id: f0ab90fa1201080e0d62fd140faf0fcdfd56601b
2019-09-19 09:50:17 -07:00
Edward Yang
ece14ff473 Implement multiple dispatch (#25653)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25653

Instead of considering only the TensorTypeSet of the first argument, we collect all Tensor and TensorList arguments and union them together before computing the dispatch type id.

Billing of changes:
* ATenDispatch fallback code (i.e., what gets run if there is no entry for a function in the table) now lives out-of-line in a function `getFallbackOp`. This gave me an opportunity to write a more detailed error message, providing information about what registrations were available. There is a TODO in the fallback code, suggesting that we could automatically redispatch in the event that there is no handler for the key. But this is a bit of a design question, because it's not clear if automatic redispatch would cover up errors in the dispatch table (i.e., there *should* have been something registered at some key, but there wasn't.)
* Collection of Tensor/TensorList arguments is done using the trusty old IterArgs helper class. A minor bit of refactoring I had to do to get here was move the IterArgs functionality in torch/csrc/utils/variadic.h into ATen/core.  There's some refactoring due on that file too (it has copies of some C++ helper pieces which already live in c10--you can't actually move the whole thing because it is literally incompatible with other code in the codebase). So instead of calling `type_set()` to get the type set of the dispatch argument, now we just call `at::detail::multi_dispatch_tensor_type_set` on all of the tensor/tensor list arguments.
* The code generator is adjusted to codegen collection of arguments as needed. There is a little bit of a hack in the code generator to turn 'self' arguments into '*this'.  I think this may be duplicated with some logic somewhere else but I have to double check.

After turning on multi-dispatch, I had to refactor existing code which previously dispatched one place, but now dispatches somewhere else. The primary component affected by this is sparse.

* Binary operations (add/sub/mul/div/addmm) now dispatch to sparse kernels even if you did add(dense, sparse). So I delete all the sparse handling code from dense kernels, and bulk up the sparse error handling to handle when the first argument is dense. In the case of addmm, I can eliminate the bridge code entirely (well, not quite: more on this below). I also updated the dispatch on sparse to actually point at sparse kernels. Pay special attention to the handling of `div_` by scalar: previously this logic lived in the "dense" `div_` implementation, but there is actually not any sparse kernel we dispatch to. I solved this particular problem by making a redispatch, but another valid approach would have been to add specific dispatches for sparse div on scalar. This codepath is poorly tested because it is only exercised from C++.
* One minor annoyance is that because I now want separate dispatch for dense and sparse, I also need to replicate the `add`, `add_`, `add_out` trifecta on the sparse side. I opted for a compromise here: I wrote new a new `add_sparse` trifecta, but reused the implementation between CPU and CUDA. This means that I hav to do another dispatch once I get to `add_out`. The alternative would have been to do twice as many copies for CPU and CUDA (thereby eliminating the extra dispatch) but that seemed distinctly not worth it.
* A lot of kernels in sparse assumed that the dispatch argument must be sparse. This is no longer true with dispatch, so I converted the asserts into plain error checking. This also means that we've perturbed the error message in the case of TestSparseOneOff.test_cuda_sparse_cpu_dense_add (I just updated the saved error message)
* `addmm` is a little bit even more special: the bridge code also handled broadcasting. I replicated the broadcasting logic between CPU and CUDA implementations to avoid an extra dispatch.
* `_sparse_addmm` gave me a bit of trouble, because I had forgotten why we had `torch.sparse.addmm` in the first place. But in the end, its changes followed along with the structural changes I made in addmm. I opted for an extra dispatch here for simplicity.
* c10d has some Variable-Tensor confusion in its sparse code. I've worked around it by judiciously inserting "no variable type" guards, but a more correct fix would be to just solve the confusion entirely.

Benchmark:

Apply the following patch to the base commit and this commit:

```
 diff --git a/aten/src/ATen/native/Const.cpp b/aten/src/ATen/native/Const.cpp
new file mode 100644
index 0000000000..b66f4d3ece
 --- /dev/null
+++ b/aten/src/ATen/native/Const.cpp
@@ -0,0 +1,10 @@
+#include <ATen/ATen.h>
+
+namespace at {
+namespace native {
+
+Tensor _const5(const Tensor& self, const Tensor& second, const Tensor& third, const Tensor& fourth, const Tensor& fifth) {
+  return self;
+}
+
+}} // namespace at::native
 diff --git a/aten/src/ATen/native/native_functions.yaml b/aten/src/ATen/native/native_functions.yaml
index b494ed7950..fddae638bb 100644
 --- a/aten/src/ATen/native/native_functions.yaml
+++ b/aten/src/ATen/native/native_functions.yaml
@@ -5878,3 +5878,9 @@
   dispatch:
     CPU: im2col_backward_cpu
     CUDA: im2col_backward_cuda
+
+# For benchmarking
+- func: _const5(Tensor self, Tensor second, Tensor third, Tensor fourth, Tensor fifth) -> Tensor
+  variants: function
+  dispatch:
+    CPU: _const5
```

Comparisons with timeit:

One-argument, representative case:

Before:

```
In [6]: %timeit x.reshape(1, 1)
1.46 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit x.reshape(1, 1)
1.48 µs ± 29.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit x.reshape(1, 1)
1.52 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit x.reshape(1, 1)
1.42 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit x.reshape(1, 1)
1.43 µs ± 1.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit x.reshape(1, 1)
1.42 µs ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Five-argument, synthetic case (we expect, with enough Tensor arguments, for there to be a slowdown, as we scale `O(n)` with number of arguments, compared to old dispatcher which is `O(1)` with number of arguments):

Before:

```
In [1]: import torch

In [2]: x = torch.zeros(1)

In [3]: %timeit torch._const5(x, x, x, x, x)
949 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
954 ns ± 1.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
947 ns ± 0.601 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

After:

```
In [3]: %timeit torch._const5(x, x, x, x, x)
985 ns ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit torch._const5(x, x, x, x, x)
984 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit torch._const5(x, x, x, x, x)
988 ns ± 0.555 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D17265918

Pulled By: ezyang

fbshipit-source-id: 221efe4e86a40f36abc81e2ebceaa7e251c90b3d
2019-09-19 09:30:40 -07:00
Mike Ruberry
d9ab78b3f0 Moves more tests to TestTorchDeviceType (#26435)
Summary:
- Moves all ROCm-requiring test_torch tests to TestTorchDeviceType
- Moves test_stft and test_lu from test_cuda
- Moves many CUDA-only test_torch tests to TestTorchDeviceType
- Combines several test_torch CPU tests with their CUDA variants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26435

Differential Revision: D17470469

Pulled By: mruberry

fbshipit-source-id: 90bb7fc09465c53eb2ab8da52eb2c2509775c16f
2019-09-19 01:49:34 -07:00
Vitaly Fedyunin
36ade9aa23 Move the CUDA implementation of rsqrt to ATen. (#25285)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25285

Fix #24620

Test Plan: Imported from OSS

Differential Revision: D17397459

fbshipit-source-id: 024dc0da8085df85513fde5f1d1e0141f734b284
2019-09-18 18:17:52 -07:00
Mike Ruberry
248d5857ae Adds dtypes decorators to and allows helper methods in device generic test classes (#26375)
Summary:
- Adds dtypes, dtypesIfCPU, and dtypesIfCUDA decorators.
- Eliminates the need for nontest members to be defined in an inherited base.
- Updates one test to use the decorators and updates TestTorchDeviceType with helpers.

This PR appears to be hanging the ROCm build, which is not entirely surprising. See https://github.com/pytorch/pytorch/issues/26394, which demonstrates that the ROCm build can be hung by commenting out a Python test that was never run on ROCm.

gchanan - what type list, if any, do you want to expose? I imagine most test suites will define their own lists like today. SCALAR_TYPES, QUANTIZED_TYPES, and ALL_TYPES seem reasonable to me. DOCUMENTED_TENSOR_TYPES will be removed, of course.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26375

Test Plan: Edit is to tests themselves.

Differential Revision: D17462294

Pulled By: mruberry

fbshipit-source-id: f8259ec66709749b1bf8077efc737676af901436
2019-09-18 15:35:52 -07:00
Mike Ruberry
388cfdf2ac Removes torchtest, expands generic device testing (#26374)
Summary:
- Removes torchtest
- <s>Moves test_torch tests skipped on ROCm to generic device test class</s>
- Creates test_nn generic device test class

Next: adding dtypes to generic device testing framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374

Test Plan: Change is to tests themselves.

Differential Revision: D17442218

Pulled By: mruberry

fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8
2019-09-18 10:24:50 -07:00
Richard Zou
0038111019 Implement named tensor unflatten(dim, namedshape). (#25658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25658

This unflattens `dim` according to the shape specified in `namedshape`.
`namedshape` may be either an OrderedDict or an iterable of (name, size)
tuples.

Future:
- It is possible to make it take a dict in Python >= 3.6 because those are
ordered by default, but I'll leave that task for the future.

Test Plan: - new tests [namedtensor ci]

Differential Revision: D17192655

Pulled By: zou3519

fbshipit-source-id: fd9bd2f462c23a4df1c23d66f2aa95076ff1b160
2019-09-17 21:24:25 -07:00
Michael Suo
a76403f609 Revert D17367016: [pytorch][PR] Enabled bfloat16 dtype on CUDA
Test Plan: revert-hammer

Differential Revision:
D17367016

Original commit changeset: 7e6ae7c6aa4e

fbshipit-source-id: 6ca4e1dec5357232e224bf6d6f957ac80005c77c
2019-09-17 10:39:59 -07:00
Iurii Zdebskyi
1accc38b75 Enabled bfloat16 dtype on CUDA (#26148)
Summary:
Enabled basic functionality for bfloat16 dtype on CUDA.
Tested via unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26148

Differential Revision: D17367016

Pulled By: izdeby

fbshipit-source-id: 7e6ae7c6aa4e21f076d8b70b91e26b50063c6875
2019-09-17 08:17:36 -07:00
vishwakftw
2dac673861 Enable batching for pinverse (#26095)
Summary:
Changelog:
- Modify existing implementation of pinverse to support batching on inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26095

Test Plan: - Added tests in test_pinverse to test batched implementation

Differential Revision: D17408092

Pulled By: soumith

fbshipit-source-id: bba95eb193ce33a94ecfaf74da270d34b435e4af
2019-09-16 23:19:16 -07:00
Hong Xu
81d7675301 Ensure that n is non-negative in polygamma.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26294

Differential Revision: D17416847

Pulled By: soumith

fbshipit-source-id: 17d5576e019e31e85c0308fb956524484e526cf6
2019-09-16 23:16:11 -07:00
Mike Ruberry
226ee7a889 Adds generic device tests to test_autograd.py (#26248)
Summary:
- Adds new decorators for skipping on ROCm, skipping on MKL, running only on the CPU and running only on CUDA
- Makes decorator skip semantics consistent
- Adds CUDA default stream requirement to MAGMA decorator
- Creates TestAutogradDeviceType

Note this PR originally moved test_cdist, but moving it caused failures in CI. There may be an undiagnosed issue with cdist or the test. The issue does not reproduce locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26248

Test Plan: Change is to tests themselves.

Differential Revision: D17410386

Pulled By: mruberry

fbshipit-source-id: 8459df44f2a00f0e71680fbe713587a01d4b0300
2019-09-16 20:25:25 -07:00
Hong Xu
c92ed8dd44 Move the CUDA implementation of round to ATen. (#25041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25041

Fix #24617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/25041

Test Plan: Imported from OSS

Differential Revision: D17114368

Pulled By: VitalyFedyunin

fbshipit-source-id: 6ec6ef99b4451acd7e93491fd4b44fca9ce1809d
2019-09-16 09:54:30 -07:00
Mike Ruberry
31139b5f9a Back out "[pytorch][PR] Refines test_torch.py generic device testing" (#26252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26252

Original commit changeset: 1375774f24c2

Testing to see if this is somehow the source of hangs on ROCm builds.

Test Plan: Change is to tests themselves. This diff is for testing the ROCm hang, however.

Differential Revision: D17390575

fbshipit-source-id: a6ffd5eb1df3971b99b6d42271a8d3d501ac79c6
2019-09-15 13:42:25 -07:00
Mike Ruberry
b6b2b4c18f Refines test_torch.py generic device testing (#26244)
Summary:
- Adds SkipCUDAIfRocm and skipCPUIfNoMkl decorators, ports corresponding tests
- Changes "SkipIf" input semantics for consistency
- Removes torchtest, which has been replaced with this new generic framework
- Refactors some common parts out of CUDA tests to TestTorchDeviceType
- Ensures all MAGMA tests run on default stream by putting the skipCUDANonDefaultStreamIf in the skipCUDAIfNoMagma decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26244

Differential Revision: D17389060

Pulled By: mruberry

fbshipit-source-id: 1375774f24c2266049e6d4b899e7300ddf32eac8
2019-09-15 03:35:23 -07:00
Mike Ruberry
b4b8f53a5d Ports most of test_torch.py to generic device type framework (#26232)
Summary:
This PR moves many tests in test_torch.py to the generic device type framework. This means that many CUDA tests now run in test_torch.py and there is greater consistency in how tests for many device types are written.

One change is that all MAGMA tests are run on the default stream due to intermittent instability running MAGMA on the non-default stream. This is a known issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26232

Test Plan:
While this PR edits the tests itself, it was validated using two independent methods:

(1) The code was reviewed and it was verified that all deleted functions were actually moved.
(2) The output of the TestTorch CI was reviewed and test outputs were matched before and after this PR.

Differential Revision: D17386370

Pulled By: mruberry

fbshipit-source-id: 843d14911bbd52e8aac6861c0d9bc3d0d9418219
2019-09-14 17:10:47 -07:00
Mike Ruberry
fbf991d062 Creates generic device type testing framework (#25967)
Summary:
This PR addresses https://github.com/pytorch/pytorch/issues/24851 by...

1. lets device types easily register themselves for testing
2. lets tests be written to run on multiple devices and with multiple dtypes
3. provides a mechanism to instantiate those tests so they are discoverable and filterable by unittest and pytest

It refactors three tests from test_torch.py to demonstrate how to use it.

`test_diagonal` is the simplest example. Most tests just need to be modified to accept 'device' as an argument. The framework will then instantiate `test_diagonal_cpu` and `test_diagonal_cuda` (when CUDA is available) which call `test_diagonal` with the appropriate 'device' argument.

`test_neg` also has dtype variants. It accepts both 'device' and 'dtype' as arguments, and the dtypes it runs with are specified with the 'dtypes' decorator. Dtypes can be specified for all device types and particular device types. The framework instantiates tests like `test_neg_cpu_torch.float`.

`test_inverse` has device-specific dependencies. These dependencies are expressed with the sugary 'skipCUDAIfNoMagma' and 'skipCPUIfNoLapack' decorators. These decorators are device-specific so CPU testing is not skipped if Magma is not installed, and there conditions may be checked after or before the test case has been initialized. This means that skipCUDAIfNoMagma does not initialize CUDA. In fact, CUDA is only initialized if a CUDA test is run.

These instantiated tests may be run as usual and with pytest filtering it's easy to run one test on all device types, run all the tests for a particular device type, or run a device type and dtype combination.

See the note "Generic Device-Type Testing" for more detail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25967

Differential Revision: D17381987

Pulled By: mruberry

fbshipit-source-id: 4a639641130f0a59d22da0efe0951b24b5bc4bfb
2019-09-13 23:34:28 -07:00
Geovanni Zhang
e293c4ea73 Fix 'in' return true incorrectly (#24156)
Summary:
Because of 'return NotImplemented', __contains__ return True when the element is not a number.
bool(NotImplemented) == True
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24156

Differential Revision: D16829895

Pulled By: zou3519

fbshipit-source-id: 9d3d58025b2b78b33a26fdfcfa6029d0d049f11f
2019-09-13 09:27:58 -07:00
Richard Zou
5e2d25af34 Implement tensor.align_as(other), change tensor.align_to(names) (#25843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25843

`tensor.align_to(*names)` permutes the dimensions of `tensor` and adds
additional 1-sized dimensions such that the output tensor has dimensions
in the same order as `names`. All dimensions of `tensor` must be
present in `names`, in addition, this function requires that all dims of
`tensor` be named.

`tensor.align_as(other)` is equivalent to
`tensor.align_to(*other.names)`.

I'm planning on changing `torch.align_tensors(*tensors)` to align closer
to these semantics because there didn't seem to be a clear use case for the old
semantics that preserve unnamed dimensions. That will come in a future
change.

Test Plan: - new tests [namedtensor ci]

Differential Revision: D17255549

Pulled By: zou3519

fbshipit-source-id: 1e437ad81e9359b4d5bd0e7e64c3a1be441fc3e3
2019-09-12 22:53:44 -07:00
Richard Zou
e544f88590 Implement tensor.refine_names (#25842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25842

`tensor.refine_names(*names)` takes `tensor` and attempts to name its
dimensions `names` out-of-place. If a dimension `i` already had a name,
then it cannot be changed (so tensor.names[i] must equal names[i]);
if the original dimension did not have a name, then the new name
(names[i]) can be anything.

`tensor.refine_names(*names)` also accepts a glob '*' that greedily selects
names from `tensor`. Here are some examples:

- `Tensor[None].refine_names('N') -> Tensor[N]`
- `Tensor[N].refine_names('N') -> Tensor[N]`
- `Tensor[N].refine_names('D') -> Error!`
- `Tensor[N].refine_names(None) -> Error!`
- `Tensor[None, None].refine_names('*', D) -> Tensor[None, D]`

Test Plan: - new tests [namedtensor ci]

Differential Revision: D17255548

Pulled By: zou3519

fbshipit-source-id: fdbdb3a12f24fbe37ce1e53ed09dc8a42589d928
2019-09-12 22:53:40 -07:00
vishwakftw
eee58f8284 Refactor torch.*solve tests (#25733)
Summary:
Changelog:
- De-duplicate the code in tests for torch.solve, torch.cholesky_solve, torch.triangular_solve
- Skip tests explicitly if requirements aren't met for e.g., if NumPy / SciPy aren't available in the environment
- Add generic helpers for these tests in test/common_utils.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25733

Test Plan:
- All tests should pass to confirm that the change is not erroneous

Clears one point specified in the discussion in https://github.com/pytorch/pytorch/issues/24333.

Differential Revision: D17315330

Pulled By: zou3519

fbshipit-source-id: c72a793e89af7e2cdb163521816d56747fd70a0e
2019-09-11 14:30:00 -07:00
Pavel Belevich
a14e884546 Migrate pow from TH to Aten (CUDA) (#25517)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24613

```
DEBUG = 0
OMP_NUM_THREADS = 1
Tesla M40

import torch

base = torch.randn(1000000, device='cuda:1')
exp  = torch.randn(1000000, device='cuda:1')
out  = torch.empty_like(base)

timeit base.pow(0)
old 53.1 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 18.7 µs ± 15 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit base.pow(1/3)
old 53.3 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.1 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/3)
old 53.3 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.1 µs ± 29.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1/2)
old 53.2 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 34.8 µs ± 40.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/2)
old 53.3 µs ± 54.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 42 µs ± 32.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1)
old 38.3 µs ± 53.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 40.1 µs ± 41.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1)
old 38.4 µs ± 29 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 35 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(2)
old 38.1 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 34.8 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-2)
old 38.3 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 35.2 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(3)
old 38.3 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 34.9 µs ± 46.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-3)
old 53.3 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.4 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(123456.789)
old 53.3 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.2 µs ± 24.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-123456.789)
old 53.5 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.3 µs ± 66.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(exp)
old 58.2 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 54.5 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp)
old 49.1 µs ± 89.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 58.7 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp)
old 48.7 µs ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 18.7 µs ± 88.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit torch.pow(-1, exp)
old 50.7 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 59.8 µs ± 100 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp)
old 49.4 µs ± 98 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 58.6 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp)
old 50.4 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 59.8 µs ± 48.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp, out=out)
old 49 µs ± 13 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 59.2 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp, out=out)
old 49.3 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 18.8 µs ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit torch.pow(-1, exp, out=out)
old 50.4 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 60.2 µs ± 71.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp, out=out)
old 49.2 µs ± 293 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 58.9 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp, out=out)
old 50.5 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 60.1 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

base = (torch.rand(1000000, device='cuda:1') * 10).to(int)
exp  = (torch.rand(1000000, device='cuda:1') * 10).to(int)
out  = torch.empty_like(base)

timeit base.pow(0)
old 75.5 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 33.8 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1/3)
old 75.5 µs ± 78.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 842 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/3)
old 75.5 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 843 µs ± 231 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1/2)
old 75.7 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 123 µs ± 71.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/2)
old 76 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 180 µs ± 55.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1)
old 74.1 µs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 72.3 µs ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1.0)
old Integers to negative integer powers are not allowed.
new 86.9 µs ± 84.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(2)
old 74.2 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 66.5 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-2.0)
old Integers to negative integer powers are not allowed.
new 87.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(3)
old 74.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 66.5 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-3.0)
old Integers to negative integer powers are not allowed.
new 861 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(123456.789)
old 256 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 863 µs ± 64.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-123456.789)
old Integers to negative integer powers are not allowed.
new 863 µs ± 57.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(exp)
old 111 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 98.8 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp)
old 81.9 µs ± 23.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 92.9 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp)
old 81.9 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 33.6 µs ± 56.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-1, exp)
old 82.2 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.6 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp)
old 82.1 µs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.8 µs ± 75.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp)
old 82.3 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 94 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp, out=out)
old 81.6 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.8 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp, out=out)
old 81.6 µs ± 26.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 33.7 µs ± 36.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-1, exp, out=out)
old 82.7 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.9 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp, out=out)
old 82.6 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.7 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp, out=out)
old 82.5 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 94 µs ± 55.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25517

Differential Revision: D17251364

Pulled By: pbelevich

fbshipit-source-id: 20904c073c311e76285eaa1b68e67e67ea3c62d8
2019-09-10 13:46:22 -07:00
Ailing Zhang
26f67e7aa7 fix scatter CPU kernel when (input size, src size) > index size (#25839)
Summary:
fixes https://github.com/pytorch/pytorch/issues/25836
According to doc, https://pytorch.org/docs/stable/tensors.html#torch.Tensor.scatter_ `index` must have the smallest size and we should iterate over `index` instead of `tensor`.
cc: dlibenzi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25839

Differential Revision: D17269116

Pulled By: ailzhang

fbshipit-source-id: 0e8569fed6c0d2dd70e4e3ec5d29d8730cd2ae8f
2019-09-10 11:41:41 -07:00
Hong Xu
57b23c61c5 In the CUDA implementation of erfinv, erfinv() should be used for double (#25337)
Summary:
This best preserves accuracy, while erfinvf() should be used for half and float.

This is also consistent with the implementation before the migration: https://github.com/pytorch/pytorch/issues/24943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25337

Differential Revision: D17102333

Pulled By: zou3519

fbshipit-source-id: 5178cff534cf5f10d86ab04d4b6c1779ffedf49e
2019-09-10 06:30:33 -07:00
Igor Fedan
bf04c2ca2f Make torch checks same for both CPU and CUDA multinomial (#25595)
Summary:
Currently we have different checks for multinomial method on CPU and CUDA. This PR will make them consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25595

Differential Revision: D17236163

Pulled By: ifedan

fbshipit-source-id: 7718173bdaf216e8eb636c2a5b9c5939b975325b
2019-09-10 05:29:58 -07:00
vishwakftw
36bdde255e Fix test_det_logdet_slogdet_batched on PowerPC (#25773)
Summary:
Changelog:
- Simplify generation of singular matrices to just constructing a constant matrix instead of a random singular matrix using random_square_matrix_of_rank, which is susceptible to numerical issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25773

Test Plan:
- test_det_logdet_slogdet_batched should pass

Fixes https://github.com/pytorch/pytorch/issues/25172

cc: branfosj hartb

Apologies for the delay.

Differential Revision: D17261059

Pulled By: soumith

fbshipit-source-id: 8f991e2cb8c0e9dccad363d4785075213088e58a
2019-09-09 19:23:42 -07:00
Richard Zou
7970e5720b Rename tensor.view_names -> tensor.renamed (#25711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25711

This function renames the dimensions of a tensor out-of-place. Because
of that, I think `tensor.renamed(...)` is a clearer name: `view_names`
has the connotation that we can use names to `view` our tensors with a
"different shape", but what this function really does is let us rename a
tensor no matter the previous names.

`tensor.names_`, the in-place version of this, is unchanged for now.
However, we might delete this or not advertise it if it has no use case
and also because its naming is a little inconsistent with `tensor.renamed`.

Test Plan: - [namedtensor ci]

Differential Revision: D17206515

Pulled By: zou3519

fbshipit-source-id: 67053951fcc8130c84566b5ebbdce35ef619c90d
2019-09-06 11:28:04 -07:00
Brian Vaughan
88e4cee3e7 Improve handling of mixed-type tensor operations (#22273)
Summary:
Improve handling of mixed-type tensor operations.

This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops).

For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts.

The details of the promotion rules are described here:
https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst

Some specific backwards incompatible examples:
* now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)`
* Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result.

See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR:
https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273

Reviewed By: gchanan

Differential Revision: D16582230

Pulled By: nairbv

fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3
2019-09-05 18:26:09 -07:00
Johannes M Dieterich
c6dd4036f5 Enable two tests that were skipped b/c of rocThrust bugs fixed in ROCm 2.7
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25724

Differential Revision: D17212373

Pulled By: bddppq

fbshipit-source-id: 2978bc13cdcd0e96a82c0019a08b589f67c0fe1d
2019-09-05 16:10:56 -07:00
iotamudelta
4fe857187c switch to rocThrust for thrust/cub APIs (#25620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602

Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.

Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.

Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.

Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.

Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864

Reviewed By: xw285cornell

Differential Revision: D16940768

Pulled By: bddppq

fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
2019-09-03 22:16:30 -07:00
vishwakftw
d1e079e2e0 Enable torch.cholesky for batches > 262140 (#24438)
Summary:
Changelog:
- Iterate over mini batches of 262140 matrices (maximum)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24438

Test Plan:
- Added slow tests to test the behavior in test_torch and test_cuda

Fixes https://github.com/pytorch/pytorch/issues/24403

Differential Revision: D17175603

Pulled By: soumith

fbshipit-source-id: 1abb0a1e92494cf43ef4ba9efb54a919cd18bfef
2019-09-03 17:35:37 -07:00
vishwakftw
1e4832ffad Enable broadcasting of batch dimensions RHS and LHS tensors for lu_solve (#24333)
Summary:
Changelog:
- Enable broadcasting of RHS and LHS tensors for lu_solve. This means that you can now have RHS with size `3 x 2` and LHS with size `4 x 3 x 3` for instance
- Remove deprecated behavior of having 2D tensors for RHS. Now all tensors have to have a last dimension which equals the number of right hand sides
- Modified docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24333

Test Plan: - Add tests for new behavior in test_torch.py with a port to test_cuda.py

Differential Revision: D17165463

Pulled By: zou3519

fbshipit-source-id: cda5d5496ddb29ed0182bab250b5d90f8f454aa6
2019-09-03 15:14:48 -07:00
Richard Zou
5c4cc1e8f3 Prepare to add some Dimname/DimnameList overloads (#25405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25405

This PR adds schemas to native_functions.yaml, core/Tensor.h, and
core/TensorMethods.h for Dimname/DimnameList overloads for the following
functions:
- min, max, max_values, min_values
- mean, meadian
- logsumexp, std, var, norm

The actual implementations will come in a later PR. I am accumulating
all the addtional schemas and changes to core/{Tensor|TensorMethods}.h
in this PR so that there is only one point of failure for potential
merge conflicts.

Test Plan: - Check that all pytorch builds still build. [namedtensor ci]

Differential Revision: D17116333

Pulled By: zou3519

fbshipit-source-id: fd666d60109a311767169261afbec0fd85cc00c8
2019-09-03 10:55:47 -07:00
davidriazati
7a921ba17d Manually implement is_zipfile (#25279)
Summary:
The default implementation is lenient in that it recognizes a zipfile if the magic number appears anywhere in the archive. So if someone has the bytes `PK\x03\x04` in a tensor, it gets recognized as a zipfile. See https://bugs.python.org/issue28494

This implementation only checks the first 4 bytes of the file for the zip magic number. We could also copy https://github.com/python/cpython/pull/5053's fix, but that seems like overkill.

Fixes #25214
](https://our.intern.facebook.com/intern/diff/17102516/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25279

Pulled By: driazati

Differential Revision: D17102516

fbshipit-source-id: 4d09645bd97e9ff7136a2229fba1d9a1bce5665a
2019-08-30 16:47:50 -07:00
CamiWilliams
329757a907 Torch.flatten() returns a 1-dim tensor on a 0-dim tensor (#25406)
Summary:
PR for `torch.flatten()` to return a 1-dim tensor on a 0-dim tensor

> torch.tensor(123).shape -> torch.Size([])
> torch.tensor(123).flatten() -> torch.tensor([123])
> torch.tensor(123).flatten().shape -> torch.Size([1])

resolve https://github.com/pytorch/pytorch/issues/22963
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25406

Differential Revision: D17120464

Pulled By: CamiWilliams

fbshipit-source-id: efbecd61f0aefd82f2ab417ca6bb467488ff99de
2019-08-30 08:53:08 -07:00
Brian Vaughan
f0c6021846 fix bug in assertNotEqual for int tensors (#25412)
Summary:
re-apply: https://github.com/pytorch/pytorch/pull/25199
but without a failing quantized test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25412

Differential Revision: D17131303

Pulled By: nairbv

fbshipit-source-id: edf7736af3ede5e809eded72be9514e922e70db4
2019-08-30 06:52:30 -07:00
Jerry Zhang
e231bd16fb Revert D17112656: [pytorch][PR] fix bug in assertNotEqual for int tensors
Test Plan: revert-hammer

Differential Revision:
D17112656

Original commit changeset: 43e0e7da6d58

fbshipit-source-id: 0a0f7b8b125f24a45023ddb46fe144f21499b723
2019-08-29 10:36:56 -07:00
Hong Xu
2e1c37c95c Move the CUDA implementation of ceil to ATen. (#24866)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24866

Fix #24542

Test Plan: Imported from OSS

Differential Revision: D16965903

Pulled By: VitalyFedyunin

fbshipit-source-id: b9decaa58bec813a23d369b5e1eec627599f41da
2019-08-29 08:48:31 -07:00
Brian Vaughan
1e2b19db6d fix bug in assertNotEqual for int tensors (#25199)
Summary:
assertNotEqual was failing to detect differences in int tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25199

Differential Revision: D17112656

Pulled By: nairbv

fbshipit-source-id: 43e0e7da6d58eb1c837a508d462a748b2065bdd9
2019-08-29 07:32:50 -07:00
Gregory Chanan
f362a5a04b Revert "Let logical_xor support non-bool tensors." (#25269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25269

This reverts commit 5ca612b55e.

Test Plan: Imported from OSS

Differential Revision: D17080088

fbshipit-source-id: e6b6215b713910c448e9a6b831b08f28b849c64a
2019-08-28 15:41:51 -07:00
SsnL
6100de9b1b implement bool_tensor.bernoulli_ (#25076)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25072
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25076

Differential Revision: D17073453

Pulled By: ezyang

fbshipit-source-id: 42410da8c9911c1d7b3543bde740c7e66ae0cc1c
2019-08-28 12:25:27 -07:00
Igor Fedan
afb7a162fb Migrate erfinv and erfinv_ from the TH to Aten (CUDA) (#24943)
Summary:
https://github.com/pytorch/pytorch/issues/24560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24943

Differential Revision: D16996434

Pulled By: ifedan

fbshipit-source-id: 77111a4e47bb2b20f65225d48e7213cd77ddae19
2019-08-28 09:30:08 -07:00
Pavel Belevich
112f249446 Port pow operator from the TH code to Aten (#23492)
Summary:
Fixing https://github.com/pytorch/pytorch/issues/24750
```
DEBUG = 0
OMP_NUM_THREADS = 1

import torch

base = torch.randn(1000000)
exp  = torch.randn(1000000)
out  = torch.empty_like(base)

timeit base.pow(0)							+30x
old 6.26 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 213 µs ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(1/3)						+6x
old 56 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.41 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(-1/3)						+6x
old 57 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.49 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(1/2)						+6x
old 4.04 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 620 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-1/2)						+5x
old 6.56 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 1.24 ms ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(1)							no diff
old 322 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
new 331 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-1)							+3.5x
old 2.48 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 717 µs ± 130 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(2)							no diff
old 328 µs ± 7.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
new 324 µs ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-2)							+3.5x
old 2.45 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 662 µs ± 3.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(3)							+7x
old 2.39 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 334 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-3)							+9x
old 93.7 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 10.3 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(123456.789)					+5x
old 46.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.68 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(-123456.789)				+5x
old 46.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 10 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(exp)						+6x
old 60.6 ms ± 4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.7 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(0, exp)					no diff
old 18.3 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 21.2 ms ± 333 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

timeit torch.pow(1, exp)					+30x
old 6.01 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 203 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit torch.pow(-1, exp)					+3x
old 30.8 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.67 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(42, exp)					+8x
old 80.1 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.51 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(-42, exp)					+2x
old 21.8 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.5 ms ± 89.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(0, exp, out=out)			no diff
old 20.2 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 22.1 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

timeit torch.pow(1, exp, out=out)			+30x
old 6.7 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 203 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit torch.pow(-1, exp, out=out)			+3x
old 32.5 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.4 ms ± 99.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(42, exp, out=out)			+10x
old 91 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.64 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(-42, exp, out=out)			+2.5x
old 25.9 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 10.1 ms ± 698 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

```

BC: enforce stronger shape requirements on the output tensor (out= keyword argument) and do not allow output tensor to be resized if it is also used as one of the inputs.
BC: enforce stronger integer tensor base power integer exponent requirement on CPU and CUDA: `Integers to negative integer powers are not allowed.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23492

Differential Revision: D16731583

Pulled By: pbelevich

fbshipit-source-id: 4e5bf689357fe82a19371e42d48abbb7b4c1c3ca
2019-08-28 09:11:50 -07:00
Igor Fedan
9b1097958e Migrate digamma\digamma_\polygamma\polygamma_ from the TH to Aten (CPU) (#25048)
Summary:
https://github.com/pytorch/pytorch/issues/24612
https://github.com/pytorch/pytorch/issues/24550
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25048

Differential Revision: D16996440

Pulled By: ifedan

fbshipit-source-id: 0d76588d179d4c932e3fc284cb399dcfc77bc622
2019-08-28 08:29:13 -07:00
Patrick Donnelly
883628cb5c Added documentation for nn.functional.bilinear (#24951)
Summary:
Adds documentation for `nn.functional.bilinear`, as requested in https://github.com/pytorch/pytorch/issues/9886.

The format follows that of `nn.functional.linear`, and borrows from `nn.bilinear` in its description of `Tensor` shapes.

I am happy to add more extensive documentation (e.g. "Args," "Example(s)"). From what I gather, the format of comments is inconsistent across functions in `nn.functional.py` and between modules (e.g. `nn.functional` and `nn`). It's my first PR, so guidance for contributing documentation and other code would be greatly appreciated!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24951

Differential Revision: D17091261

Pulled By: soumith

fbshipit-source-id: efe2ad764700dfd6f30eedc03de4e1cd0d10ac72
2019-08-28 08:19:25 -07:00
Funtowicz Morgan
2c22076342 Moving sign function to ATen (#22861)
Summary:
This PR linked to https://github.com/pytorch/pytorch/issues/22806 moving sign function to ATen.

sign(x) supports bool,  and vectorized operation on CPU.
sign(NaN) is defined to return 0.
sign(bool) is a no-op, the resulting tensor will holds the same values than the input one.

- [x] CPU Backend
- [x] CUDA Backend
- [x] Bring support for bool dtype
- [x] Bring support for Half dtype
- [x] Add test for NaN
- [x] Add test for bool dtype
- [x] Delete legacy implementation in THTensorMoreMath.cpp

Performances:
```python
timeit -s 'import torch; x = torch.randn((1000, 1000))' -n 1000 'torch.sign(x)'
timeit -s 'import torch; x = torch.randn((1000, 1000), device="cuda")' -n 1000 'torch.sign(x); torch.cuda.synchronize()'
```

| device |  before  | after |
| :-------------: | :-------------: | :-----: |
| CPU    | 1.24 msec | 33.9 usec |
| GPU    | 680 usec | 7.13 usec  |
| CPU (1 thread) | 0.82 msec | 0.73 msec |
| GPU (1 thread) | 16.1 used | 15.9 usec |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22861

Differential Revision: D16503452

Pulled By: VitalyFedyunin

fbshipit-source-id: a87ce7fff139642ef4ed791f15873074ad0d53af
2019-08-27 19:01:34 -07:00
Pavel Belevich
30bc65271d torch.from_numpy fix for np.int (#25139)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/22615
Because of different sizeof(long) we have the following relations between NPY_TYPES and NPY_INTXX aliases:
```
int value	Enum			Unix		Windows
1		NPY_BYTE		NPY_INT8	NPY_INT8
3		NPY_SHORT		NPY_INT16	NPY_INT16
5		NPY_INT			NPY_INT32	-
7		NPY_LONG		NPY_INT64	NPY_INT32
9		NPY_LONGLONG		-		NPY_INT64
```
I suggest the following fix for `numpy_dtype_to_aten` method:
```
if (dtype == NPY_INT || dtype == NPY_INT32) {
	return kInt;
} else if (dtype == NPY_LONGLONG || dtype == NPY_INT64) {
	return kLong;
}
```
On Unix it will be replaced with:
Unix:
```
if (dtype == 5 || dtype == 5) {
	return kInt;
} else if (dtype == 9 || dtype == 7) {
	return kLong;
}
```
and on Windows with:
```
if (dtype == 5 || dtype == 7) {
	return kInt;
} else if (dtype == 9 || dtype == 9) {
	return kLong;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25139

Differential Revision: D17048443

Pulled By: pbelevich

fbshipit-source-id: 9f2c27ff2829b893a35d3d57f176a58e7749a468
2019-08-26 05:07:22 -07:00
Kexuan Sun
4b3ea92787 Test if descriptions of args are in the template (#24161)
Summary:
As in https://github.com/pytorch/pytorch/issues/23439, some descriptions of arguments in `_torch_docs.py` have been replaced by `common_args`, it would be helpful to check if any descriptions can be replaced for new docs in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24161

Differential Revision: D16889293

Pulled By: ezyang

fbshipit-source-id: bf6f581494482d6eb32e634f73e84a4586766230
2019-08-20 16:34:50 -07:00
Max Balandat
d33623f7c1 Make SobolEngine use random seed if not specified (#24884)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/24881. Makes behavior consistent with the rest of the random functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24884

Test Plan: Unit tests

Reviewed By: sdsingh

Differential Revision: D16912036

Pulled By: Balandat

fbshipit-source-id: eff00cca989926a5d9e20d8846a8674f7cd270cb
2019-08-20 09:22:41 -07:00
Pavel Belevich
6100205eb8 TensorIterator::binary_op input-output overlap check (#24058)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/8212

This fix is based on the idea that in-place ops(e.g. add_(...)) and out ops(e.g. tensor.add(..., out=...)) must check that the output tensor does not partially overlap with any of it's input tensors. Otherwise the result of such op is unexpected to the user. Since TensorIterator is a common backend for such ops and it's already used to check output self-overlapping, this fix is implemented in the same place.

MemOverlapStatus enum class is introduced to model two tensors overlapped state:

- TOO_HARD if at least one of them is not contiguous
- FULL if both are contiguous and share exactly the same memory array [data(), data() + numel() *itemsize()]
- PARTIAL is both are contiguous but underlying memory is shared partially, in other words memory arrays overlap but not identical.
- NO if both are contiguous but have independent non overlapping memory arrays

Performance test of clone/addcmul_/addcdiv_ with check_mem_overlaps:

a = torch.empty(10000000, device='cpu')
b = torch.randn(10000000, device='cpu')
timeit a.copy_(b)
master: 10.3 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
branch: 10.2 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

a = torch.empty(10000000, device='cuda')
b = torch.randn(10000000, device='cuda')
timeit a.copy_(b)
master: 373 µs ± 97.9 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
branch: 373 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a = torch.randn(1000000, device='cpu')
b = torch.randn(1000000, device='cpu')
c = torch.randn(1000000, device='cpu')
timeit a.addcmul_(b, c)
master: 2.02 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
branch: 2.11 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a = torch.randn(1000000, device='cuda')
b = torch.randn(1000000, device='cuda')
c = torch.randn(1000000, device='cuda')
timeit a.addcmul_(b, c)
master: 72.6 µs ± 627 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	72.4 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(1000000, device='cpu')
b = torch.randn(1000000, device='cpu')
c = torch.randn(1000000, device='cpu')
timeit a.addcdiv_(b, c)
master: 2.19 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 1000 loop each)
branch:	1.97 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a = torch.randn(1000000, device='cuda')
b = torch.randn(1000000, device='cuda')
c = torch.randn(1000000, device='cuda')
timeit a.addcdiv_(b, c)
master: 71.3 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	71.7 µs ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.empty(100, device='cpu')
b = torch.randn(100, device='cpu')
timeit a.copy_(b)
master: 12.1 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
branch:	11.1 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

a = torch.empty(100, device='cuda')
b = torch.randn(100, device='cuda')
timeit a.copy_(b)
master: 20.9 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	22.8 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cpu')
b = torch.randn(100, device='cpu')
c = torch.randn(100, device='cpu')
timeit a.addcmul_(b, c)
master: 24.1 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	24 µs ± 91.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cuda')
b = torch.randn(100, device='cuda')
c = torch.randn(100, device='cuda')
timeit a.addcmul_(b, c)
master: 34.5 µs ± 4.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	29.8 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cpu')
b = torch.randn(100, device='cpu')
c = torch.randn(100, device='cpu')
timeit a.addcdiv_(b, c)
master: 21.3 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	23.8 µs ± 403 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cuda')
b = torch.randn(100, device='cuda')
c = torch.randn(100, device='cuda')
timeit a.addcdiv_(b, c)
master: 30.3 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	31.8 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24058

Differential Revision: D16767892

Pulled By: pbelevich

fbshipit-source-id: 0cdaaa471d003a2886b1736f8985842226b8493a
2019-08-19 15:06:04 -07:00
Vishwak Srinivasan
4358cbe01b Allow torch.tril / triu to handle bool and half inputs (#24163)
Summary:
Changelog:
- Enable torch.tril / triu for bool and float16 dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24163

Test Plan:
- Tests added in test_torch.py for all devices and dtypes (except bfloat16)

Fixes https://github.com/pytorch/pytorch/issues/24035

Differential Revision: D16793315

Pulled By: ezyang

fbshipit-source-id: 2bbc51ce567405a7cb2d8ab567eee6c2e40aa76a
2019-08-19 15:02:53 -07:00
vishwakftw
f849ebf1fe Enable torch.eye for bool and half (#24148)
Summary:
Changelog:
- Enable torch.eye for bool and float16 dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24148

Test Plan:
- Tests added in test_torch.py for all available devices and dtypes (except torch.bfloat16)

Fixes https://github.com/pytorch/pytorch/issues/24088

Differential Revision: D16891048

Pulled By: ezyang

fbshipit-source-id: 3e86fe271bd434300c396e63f82c1a1f3adac2b4
2019-08-19 14:59:37 -07:00
Iurii Zdebskyi
eee3e92936 Enabled torch.mm and torch.mv for bfloat16
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24224

Test Plan: Imported from OSS

Differential Revision: D16779996

Pulled By: izdeby

fbshipit-source-id: c859d8945a564edfa3f8a1430f140ae30d484d19
2019-08-16 15:46:15 -07:00
Heungsub Hans Lee
e166811598 Documentation for Tensor.record_stream() (#24078)
Summary:
This patch writes documentation for `Tensor.record_stream()`, which is not a documented API currently. I've discussed publishing it with colesbury in https://github.com/pytorch/pytorch/issues/23729.

The documentation is based on [the introduction at `CUDACachingAllocator.cpp`](25d1496d58/c10/cuda/CUDACachingAllocator.cpp (L47-L50)). ~~I didn't explain full details of the life cycle of memory blocks or stream awareness of the allocator for the consistent level of details with other documentations.~~ I explained about the stream awareness in a note block.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24078

Differential Revision: D16743526

Pulled By: zou3519

fbshipit-source-id: 05819c3cc96733e2ba93c0a7c0ca06933acb22f3
2019-08-16 08:07:33 -07:00
Hong Xu
5ca612b55e Let logical_xor support non-bool tensors.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23978

Test Plan: Imported from OSS

Differential Revision: D16719299

Pulled By: gchanan

fbshipit-source-id: 2fe170be6090733e20410db7cf99266543299c58
2019-08-15 12:21:31 -07:00
Hong Xu
00e4870001 Let logical_not support non-bool tensors. (#23916)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23916

Test Plan: Imported from OSS

Differential Revision: D16719300

Pulled By: gchanan

fbshipit-source-id: 5be6aeea9a38cc40ad59d0449d25a25f7dfa2787
2019-08-15 12:21:27 -07:00
Iurii Zdebskyi
52b4221bfa Enabled masked methods for bfloat16 (#24183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24183

-----------
Fix: Enabled masked select/scatter/fill  for BFloat16 on CPU
Test: via unit tests

Test Plan: Imported from OSS

Differential Revision: D16763461

Pulled By: izdeby

fbshipit-source-id: fe733635a2064e5a088a108ff77c2a1a1487a27c
2019-08-15 08:45:24 -07:00
Hong Xu
bc92ce9e07 Recommend logical_not() instead of bitwise_not() when applying sub and neg on bool tensors. (#23860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23860

Close #23836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23860

Test Plan: Imported from OSS

Differential Revision: D16678299

Pulled By: gchanan

fbshipit-source-id: b08e77f6a41c3994240849985caaff7c559d3f83
2019-08-15 08:40:29 -07:00
Hong Xu
338f9c860f Add logical_xor operator (#23847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23847

Related to #23836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23847

Test Plan: Imported from OSS

Differential Revision: D16678300

Pulled By: gchanan

fbshipit-source-id: 67020aca5830b6bec2f561105954e0a8c2ee37e0
2019-08-15 08:40:25 -07:00
Hong Xu
1f4c73618c Add logical_not operator. (#23839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23839

Close #23836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/23839

Test Plan: Imported from OSS

Differential Revision: D16678301

Pulled By: gchanan

fbshipit-source-id: 54e7b3f3b04c577e239b88493247e1c036266774
2019-08-15 08:40:21 -07:00
Jie
064d156511 (#23574)
Summary:
Assert that there's no multiple written-to to a single memory location, which
caused corrupted output.
Fixed batched matrix trlu logic, which relies on the previous copy behavior to
support tensors with stride 0 at leading dimension.

This fixes the issue proposed at: https://github.com/pytorch/pytorch/issues/23063
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23574

Differential Revision: D16600717

Pulled By: ezyang

fbshipit-source-id: e41e14f03eccf97398b64ba43647110beb1529e6
2019-08-14 21:12:07 -07:00
Kexuan Sun
e2a6212912 Resolve unused variables in tests (#24075)
Summary:
Variables such as `device` and `sparse` in for loops should be used in tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24075

Differential Revision: D16763073

Pulled By: ezyang

fbshipit-source-id: 8735cbc8d9ed695db8489cfc949c895180a7b826
2019-08-14 21:02:52 -07:00
Vitaly Fedyunin
a5872a16a0 Rename torchtest.test_all_device_types to torchtest.for_all_device_types (#24337)
Summary:
Rename decorator to `for_all_device_types` as `test_` prefixed name recognized as test in some environments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24337

Differential Revision: D16806807

Pulled By: VitalyFedyunin

fbshipit-source-id: 3132366046e183329ba5838a4bc29441fdb5bd4e
2019-08-14 12:09:51 -07:00
Richard Zou
f996f8d61d Update tensor.view_names / tensor.names_ API (#23973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23973

Without loss of generality, I describe the API for `tensor.view_names`.
`tensor.names_` has an analogous API.

`tensor.view_names(*names)` returns a view on tensor with named dims `names`.
`names` must be of length `tensor.dim()`; otherwise, if '*' is in `names`,
then it (known as the "glob") is expanded greedily to be equal to the
corresponding names from `tensor.names`.

For example,
```
>>> x = torch.empty(2, 3, 5, 7, names=('N', 'C', 'H', 'W'))
>>> x.view_names('*', 'height', 'width').names
('N', 'C', 'height', 'width')

>>> x.view_names('batch', '*', 'width').names
('batch', 'C', 'H', 'width')
```

tensor.view_names(**rename_map) returns a view on tensor that has
renamed dims as specified in the mapping `rename_map`.

For example,
```
>>> x = torch.empty(2, 3, 5, 7, names=('N', 'C', 'H', 'W'))
>>> x.view_names(W='width', H='height').names
('N', 'C', 'height', 'width')
```

These are different(!!!) from the C++ API, which only allows the
following:
- tensor.view_names(optional<DimnameList>)

C++ API parity for named tensors is not important right now; I am
punting that to the future.

Test Plan: - [namedtensor ci]

Differential Revision: D16710916

Pulled By: zou3519

fbshipit-source-id: 7cb8056c0fb4c97b04c3a2d1dd0f737e0a67ce34
2019-08-14 09:40:35 -07:00
Richard Zou
2fcdb3a1f3 Rename set_names -> view_names, set_names_ -> names_ (#23962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23962

This change should make the semantics clearer.

`tensor.names_(names)` sets tensor.names to be `names`.

`tensor.view_names(names)` returns a view of the tensor with names
`names`.

Test Plan
- [namedtensor ci]

Test Plan: Imported from OSS

Differential Revision: D16710915

Pulled By: zou3519

fbshipit-source-id: c82fa9812624d03c86f7be84b0a460e3c047aaa0
2019-08-14 09:40:31 -07:00
Richard Zou
7030f2c623 Implement tensor.align_to(names), torch.align_tensors(*tensors) (#23804)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23804

`output = tensor.align_to(names)` returns a view of `tensor` such that
`output.names = names`. Dimensions with the same names in `tensor` and
`output` have the same sizes; dimensions with new names have size 1.

The following must be true for this operation to succeed:
1) tensor.names must be a subsequence (not necessarily contiguous) of `names`
2) Aligning tensor.names to names must not change the absolute position from the
   right of any unnamed dimension.

In practice, these constraints mean that aligning cannot transpose
names.

Some examples:
- Tensor[C].align_to(C) -> Tensor[C]
- Tensor[N].align_to([N, C]) -> Tensor[N, C]
- Tensor[H, W].align_to([N, H, W, C]) -> Tensor[N, H, W, C]
- Tensor[None].align_to([N, None]) -> Tensor[N, None]
- Tensor[N].align_to([N, None None]) -> Tensor[N, None, None]

Examples of error cases:
- Tensor[W, H].align_to([N, H, W, C]) -> Error (not a subsequence)
- Tensor[None, H].align_to([None, H, W]) -> Error (would change the
absolute position from the right of a None dimension)

`torch.align_tensors(*tensors)` aligns the named dimensions of each
tensor according to the alignment rules so that they can be used in an
operation. More concretely, it aligns each tensor to the
longest names among the names of the tensors in `tensors`.

This allows users to emulate "broadcasting by names", which is one of
the things named tensors tries to enable. Here is an example:

```
imgs: Tensor[N, C, H, W]
scale: Tensor[N]

// Doesn't work because we do broadcasting by alignment by default
imgs * scale

// Does work
imgs, scale = torch.align_tensors(imgs, scale)
imas * scale
```

Future:
- Consider allowing broadcasting by names by default.

Test Plan:
- The diff looks pretty large but more than half of it is testing.
- new tests [namedtensor ci]

Differential Revision: D16657927

Pulled By: zou3519

fbshipit-source-id: e2f958bf5146c8ee3b694aba57d21b08e928a4e6
2019-08-14 09:40:27 -07:00
Richard Zou
98a3b3d565 Add name propagation for at::alias, add tensor.set_names (#24202)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24202

tensor.set_names(names) is the out-of-place variant of
tensor.set_names_(names). This naming is probably confusing so I am
taking any and all suggestions.

Test Plan: - run tests [namedtensor ci]

Differential Revision: D16773014

Pulled By: zou3519

fbshipit-source-id: 61024303c1a34db631cc4cb2c53757345e40d72c
2019-08-13 17:01:18 -07:00
Iurii Zdebskyi
0ea8f22951 Enabled comparison ops for bfloat16 dtype on CPU (#24182)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24182

-----
Fix: Enabled comparison operations for BFloat16 on CPU
Test: via unit tests

Test Plan: Imported from OSS

Differential Revision: D16763460

Pulled By: izdeby

fbshipit-source-id: 885ff9006d3bd60bb945147c3b86f97cd0d26f7b
2019-08-13 15:44:24 -07:00
Vitaly Fedyunin
6d14f7a214 Simplify tests that should cover all possible devices (#23824)
Summary:
This PR introduce `pytorchtest.test_all_device_types()` decorator which helps to write CPU, CUDA tests faster, iterating single test through all available devices

Simple `test_var_mean_some_dims`  becomes
```
test_var_mean_some_dims (__main__.TestTorch) ... ok
test_var_mean_some_dims_cpu (__main__.TestTorch) ... ok
test_var_mean_some_dims_cuda (__main__.TestTorch) ... ok
```

```python

class pytorchtest():
    """Allows to generate and run per-device unittests.

    This decorator class allows to generate and run per-device unittest.

    Example:

    class _TestTorchMixin(pytorchtest):

        pytorchtest.test_all_device_types()
        def test_zeros_like(self, device):
            expected = torch.zeros((100, 100,), device=device)

    Will execute:

        test_zeros_like (__main__.TestTorch) ... skipped 'Look at test_zeros_like_cpu, test_zeros_like_cuda results.'
        test_zeros_like_cpu (__main__.TestTorch) ... ok
        test_zeros_like_cuda (__main__.TestTorch) ... ok

    To work properly, test class should be inherited from the `pytorchtest`.
    test_all_device_types decorator does not guarantee proper functionality in
    combination with other decorators.

    Please do not extend this decorator to support other cases (such as dtype,
    layouts, etc) without consulting with bigger group. Devices is the special
    case as build flags control additions/removals (see
    https://github.com/pytorch/pytorch/pull/23824 for the reference).
    """
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23824

Differential Revision: D16716959

Pulled By: VitalyFedyunin

fbshipit-source-id: ba39af0f9bce2c4a64da421bbc24d6a1c1d9139d
2019-08-13 09:36:31 -07:00
Brian Vaughan
465b4de9d4 add function name to error messages generated by checked_tensor_unwrap (#24187)
Summary:
Improve error messages by showing the relevant function call that failed.

Before:
```
>>> torch.ones(1, dtype=torch.float) < torch.ones(1, dtype=torch.double)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument https://github.com/pytorch/pytorch/issues/2 'other'
```

After:

```
>>> torch.ones(1, dtype=torch.float) < torch.ones(1, dtype=torch.double)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument https://github.com/pytorch/pytorch/issues/2 'other' in call to _th_lt
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24187

Differential Revision: D16769167

Pulled By: nairbv

fbshipit-source-id: 4992eb4e86bdac2ab8805cc5356f7f92c63e1255
2019-08-12 14:02:22 -07:00
Richard Zou
75db368031 Revert D16763388: Add name propagation for at::alias, add tensor.set_names
Differential Revision:
D16763388

Original commit changeset: 4b2fb3acc051

fbshipit-source-id: 5be35bdcc2e7c71378af9e34be19305bdd4ba0d1
2019-08-12 13:42:43 -07:00
Richard Zou
1108fa1acb Add name propagation for at::alias, add tensor.set_names (#24105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24105

tensor.set_names(names) is the out-of-place variant of
tensor.set_names_(names). This naming is probably confusing so I am
taking any and all suggestions.

Test Plan: - run tests [namedtensor ci]

Differential Revision: D16763388

Pulled By: zou3519

fbshipit-source-id: 4b2fb3acc0514515e7ca805dbc5c3d4a9bd96317
2019-08-12 12:44:56 -07:00
Igor Fedan
1d3d92e770 Port addcdiv operator from the TH code to Aten
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24086

Differential Revision: D16733306

Pulled By: ifedan

fbshipit-source-id: c103bc44e0bb42dff0229252e1a12ce9b4e5aeae
2019-08-09 13:48:01 -07:00
Gregory Chanan
e81f296807 Fixed Bool in IsIntegralType bug (plus review comments) (#23942)
Summary:
Same as https://github.com/pytorch/pytorch/pull/23887, but also includes review comments, so we can kick off a build.

Original PR:
This [PR](https://github.com/pytorch/pytorch/pull/23346) caused [this](https://github.com/pytorch/pytorch/issues/23882) bug.

Fix:
- Deprecate old isIntegralType and add overload which takes a boolean flag which tells if torch.bool should be included in integral types or not.

Testing:
- Added extra test cases
- Tested via running unit tests locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23942

Differential Revision: D16688056

Pulled By: gchanan

fbshipit-source-id: eff457e27b13e116c05ffd022b2fb0495abe0e97
2019-08-09 12:25:27 -07:00
Richard Zou
0bba302da5 Revert D16621830: Add name propagation for at::alias, add tensor.set_names
Differential Revision:
D16621830

Original commit changeset: f8a3837d3a37

fbshipit-source-id: 801ab858a0741d98b0b9d56763fa70a9010fe75e
2019-08-09 10:55:18 -07:00
Richard Zou
78f3b883f0 Add name propagation for at::alias, add tensor.set_names (#23624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23624

tensor.set_names(names) is the out-of-place variant of
tensor.set_names_(names). This naming is probably confusing so I am
taking any and all suggestions.

Test Plan:
- run tests [namedtensor ci]

gh-metadata: pytorch pytorch 23624 gh/zou3519/86/head

Differential Revision: D16621830

Pulled By: zou3519

fbshipit-source-id: f8a3837d3a370b41210e938369348dcbb4aee53a
2019-08-09 09:17:31 -07:00
Edward Yang
21ea0a115c Revert D16627924: [pytorch][PR] Port addcdiv operator from the TH code to Aten
Differential Revision:
D16627924

Original commit changeset: 960856d30fd3

fbshipit-source-id: a375a3ede5ef956a07fb55c7b4a5d4fc34c96ddb
2019-08-09 08:33:44 -07:00
Hong Xu
2e8557778b Refactor randperm test (#23526)
Summary:
CPU and CUDA testing code are largely the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23526

Reviewed By: ezyang

Differential Revision: D16586271

Pulled By: VitalyFedyunin

fbshipit-source-id: 91c70c05789120fde4718ce955de243087a8c993
2019-08-09 08:33:35 -07:00
Stefan Krah
478c793065 Remove numpy assert that fails on Windows (older numpy versions). (#24012)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24001.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24012

Differential Revision: D16732191

Pulled By: ezyang

fbshipit-source-id: 36660a6635ab64d2f63278b1616deb1282dea037
2019-08-09 07:55:02 -07:00
Igor Fedan
fb77f14054 Port addcdiv operator from the TH code to Aten (#23683)
Summary:
https://github.com/pytorch/pytorch/issues/22796
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23683

Differential Revision: D16627924

Pulled By: ifedan

fbshipit-source-id: 960856d30fd3f79394925eddd0152cc5e27b39b3
2019-08-09 07:44:57 -07:00
Igor Fedan
9114089d70 port atan2 from TH to ATen (#23558)
Summary:
https://github.com/pytorch/pytorch/issues/22799
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23558

Differential Revision: D16591638

Pulled By: ifedan

fbshipit-source-id: d12d4c8229337a22a3278f0c7a8bbc9a86d4c9b7
2019-08-09 07:44:53 -07:00
Iurii Zdebskyi
5b9f55f33f Enable Add, sub, mul, and div on CPU for bfloat16 type. (#22851)
Summary:
Enable Add, sub, mul, and div on CPU for bfloat16 type.
Tested via unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22851

Differential Revision: D16256757

Pulled By: izdeby

fbshipit-source-id: 8b62f7581fc0ca0d2cff48ab40d877a9fcf70a5b
2019-08-08 12:34:25 -07:00
Igor Fedan
341d5934b7 Move addcmul to Aten(CUDA) (#23814)
Summary:
https://github.com/pytorch/pytorch/issues/22797
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23814

Differential Revision: D16712381

Pulled By: ifedan

fbshipit-source-id: aeca4fdb9b10143932f195900b1f424ef6d26c89
2019-08-08 12:34:21 -07:00
Ojas Ahuja
10b1254edd fix crash on torch.Tensor.repeat() for 0 repeats (#23766)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/23603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23766

Differential Revision: D16644866

Pulled By: soumith

fbshipit-source-id: ee7d368afdfe874133d0bd90f4d03a191ee22b13
2019-08-07 09:16:00 -07:00