Commit Graph

828 Commits

Author SHA1 Message Date
Mike Ruberry
388cfdf2ac Removes torchtest, expands generic device testing (#26374)
Summary:
- Removes torchtest
- <s>Moves test_torch tests skipped on ROCm to generic device test class</s>
- Creates test_nn generic device test class

Next: adding dtypes to generic device testing framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26374

Test Plan: Change is to tests themselves.

Differential Revision: D17442218

Pulled By: mruberry

fbshipit-source-id: d7e4451d09fc9049478b35a7efb8bb580071e8c8
2019-09-18 10:24:50 -07:00
Richard Zou
0038111019 Implement named tensor unflatten(dim, namedshape). (#25658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25658

This unflattens `dim` according to the shape specified in `namedshape`.
`namedshape` may be either an OrderedDict or an iterable of (name, size)
tuples.

Future:
- It is possible to make it take a dict in Python >= 3.6 because those are
ordered by default, but I'll leave that task for the future.

Test Plan: - new tests [namedtensor ci]

Differential Revision: D17192655

Pulled By: zou3519

fbshipit-source-id: fd9bd2f462c23a4df1c23d66f2aa95076ff1b160
2019-09-17 21:24:25 -07:00
Michael Suo
a76403f609 Revert D17367016: [pytorch][PR] Enabled bfloat16 dtype on CUDA
Test Plan: revert-hammer

Differential Revision:
D17367016

Original commit changeset: 7e6ae7c6aa4e

fbshipit-source-id: 6ca4e1dec5357232e224bf6d6f957ac80005c77c
2019-09-17 10:39:59 -07:00
Iurii Zdebskyi
1accc38b75 Enabled bfloat16 dtype on CUDA (#26148)
Summary:
Enabled basic functionality for bfloat16 dtype on CUDA.
Tested via unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26148

Differential Revision: D17367016

Pulled By: izdeby

fbshipit-source-id: 7e6ae7c6aa4e21f076d8b70b91e26b50063c6875
2019-09-17 08:17:36 -07:00
vishwakftw
2dac673861 Enable batching for pinverse (#26095)
Summary:
Changelog:
- Modify existing implementation of pinverse to support batching on inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26095

Test Plan: - Added tests in test_pinverse to test batched implementation

Differential Revision: D17408092

Pulled By: soumith

fbshipit-source-id: bba95eb193ce33a94ecfaf74da270d34b435e4af
2019-09-16 23:19:16 -07:00
Hong Xu
81d7675301 Ensure that n is non-negative in polygamma.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26294

Differential Revision: D17416847

Pulled By: soumith

fbshipit-source-id: 17d5576e019e31e85c0308fb956524484e526cf6
2019-09-16 23:16:11 -07:00
Mike Ruberry
226ee7a889 Adds generic device tests to test_autograd.py (#26248)
Summary:
- Adds new decorators for skipping on ROCm, skipping on MKL, running only on the CPU and running only on CUDA
- Makes decorator skip semantics consistent
- Adds CUDA default stream requirement to MAGMA decorator
- Creates TestAutogradDeviceType

Note this PR originally moved test_cdist, but moving it caused failures in CI. There may be an undiagnosed issue with cdist or the test. The issue does not reproduce locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26248

Test Plan: Change is to tests themselves.

Differential Revision: D17410386

Pulled By: mruberry

fbshipit-source-id: 8459df44f2a00f0e71680fbe713587a01d4b0300
2019-09-16 20:25:25 -07:00
Hong Xu
c92ed8dd44 Move the CUDA implementation of round to ATen. (#25041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25041

Fix #24617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/25041

Test Plan: Imported from OSS

Differential Revision: D17114368

Pulled By: VitalyFedyunin

fbshipit-source-id: 6ec6ef99b4451acd7e93491fd4b44fca9ce1809d
2019-09-16 09:54:30 -07:00
Mike Ruberry
31139b5f9a Back out "[pytorch][PR] Refines test_torch.py generic device testing" (#26252)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26252

Original commit changeset: 1375774f24c2

Testing to see if this is somehow the source of hangs on ROCm builds.

Test Plan: Change is to tests themselves. This diff is for testing the ROCm hang, however.

Differential Revision: D17390575

fbshipit-source-id: a6ffd5eb1df3971b99b6d42271a8d3d501ac79c6
2019-09-15 13:42:25 -07:00
Mike Ruberry
b6b2b4c18f Refines test_torch.py generic device testing (#26244)
Summary:
- Adds SkipCUDAIfRocm and skipCPUIfNoMkl decorators, ports corresponding tests
- Changes "SkipIf" input semantics for consistency
- Removes torchtest, which has been replaced with this new generic framework
- Refactors some common parts out of CUDA tests to TestTorchDeviceType
- Ensures all MAGMA tests run on default stream by putting the skipCUDANonDefaultStreamIf in the skipCUDAIfNoMagma decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26244

Differential Revision: D17389060

Pulled By: mruberry

fbshipit-source-id: 1375774f24c2266049e6d4b899e7300ddf32eac8
2019-09-15 03:35:23 -07:00
Mike Ruberry
b4b8f53a5d Ports most of test_torch.py to generic device type framework (#26232)
Summary:
This PR moves many tests in test_torch.py to the generic device type framework. This means that many CUDA tests now run in test_torch.py and there is greater consistency in how tests for many device types are written.

One change is that all MAGMA tests are run on the default stream due to intermittent instability running MAGMA on the non-default stream. This is a known issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26232

Test Plan:
While this PR edits the tests itself, it was validated using two independent methods:

(1) The code was reviewed and it was verified that all deleted functions were actually moved.
(2) The output of the TestTorch CI was reviewed and test outputs were matched before and after this PR.

Differential Revision: D17386370

Pulled By: mruberry

fbshipit-source-id: 843d14911bbd52e8aac6861c0d9bc3d0d9418219
2019-09-14 17:10:47 -07:00
Mike Ruberry
fbf991d062 Creates generic device type testing framework (#25967)
Summary:
This PR addresses https://github.com/pytorch/pytorch/issues/24851 by...

1. lets device types easily register themselves for testing
2. lets tests be written to run on multiple devices and with multiple dtypes
3. provides a mechanism to instantiate those tests so they are discoverable and filterable by unittest and pytest

It refactors three tests from test_torch.py to demonstrate how to use it.

`test_diagonal` is the simplest example. Most tests just need to be modified to accept 'device' as an argument. The framework will then instantiate `test_diagonal_cpu` and `test_diagonal_cuda` (when CUDA is available) which call `test_diagonal` with the appropriate 'device' argument.

`test_neg` also has dtype variants. It accepts both 'device' and 'dtype' as arguments, and the dtypes it runs with are specified with the 'dtypes' decorator. Dtypes can be specified for all device types and particular device types. The framework instantiates tests like `test_neg_cpu_torch.float`.

`test_inverse` has device-specific dependencies. These dependencies are expressed with the sugary 'skipCUDAIfNoMagma' and 'skipCPUIfNoLapack' decorators. These decorators are device-specific so CPU testing is not skipped if Magma is not installed, and there conditions may be checked after or before the test case has been initialized. This means that skipCUDAIfNoMagma does not initialize CUDA. In fact, CUDA is only initialized if a CUDA test is run.

These instantiated tests may be run as usual and with pytest filtering it's easy to run one test on all device types, run all the tests for a particular device type, or run a device type and dtype combination.

See the note "Generic Device-Type Testing" for more detail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25967

Differential Revision: D17381987

Pulled By: mruberry

fbshipit-source-id: 4a639641130f0a59d22da0efe0951b24b5bc4bfb
2019-09-13 23:34:28 -07:00
Geovanni Zhang
e293c4ea73 Fix 'in' return true incorrectly (#24156)
Summary:
Because of 'return NotImplemented', __contains__ return True when the element is not a number.
bool(NotImplemented) == True
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24156

Differential Revision: D16829895

Pulled By: zou3519

fbshipit-source-id: 9d3d58025b2b78b33a26fdfcfa6029d0d049f11f
2019-09-13 09:27:58 -07:00
Richard Zou
5e2d25af34 Implement tensor.align_as(other), change tensor.align_to(names) (#25843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25843

`tensor.align_to(*names)` permutes the dimensions of `tensor` and adds
additional 1-sized dimensions such that the output tensor has dimensions
in the same order as `names`. All dimensions of `tensor` must be
present in `names`, in addition, this function requires that all dims of
`tensor` be named.

`tensor.align_as(other)` is equivalent to
`tensor.align_to(*other.names)`.

I'm planning on changing `torch.align_tensors(*tensors)` to align closer
to these semantics because there didn't seem to be a clear use case for the old
semantics that preserve unnamed dimensions. That will come in a future
change.

Test Plan: - new tests [namedtensor ci]

Differential Revision: D17255549

Pulled By: zou3519

fbshipit-source-id: 1e437ad81e9359b4d5bd0e7e64c3a1be441fc3e3
2019-09-12 22:53:44 -07:00
Richard Zou
e544f88590 Implement tensor.refine_names (#25842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25842

`tensor.refine_names(*names)` takes `tensor` and attempts to name its
dimensions `names` out-of-place. If a dimension `i` already had a name,
then it cannot be changed (so tensor.names[i] must equal names[i]);
if the original dimension did not have a name, then the new name
(names[i]) can be anything.

`tensor.refine_names(*names)` also accepts a glob '*' that greedily selects
names from `tensor`. Here are some examples:

- `Tensor[None].refine_names('N') -> Tensor[N]`
- `Tensor[N].refine_names('N') -> Tensor[N]`
- `Tensor[N].refine_names('D') -> Error!`
- `Tensor[N].refine_names(None) -> Error!`
- `Tensor[None, None].refine_names('*', D) -> Tensor[None, D]`

Test Plan: - new tests [namedtensor ci]

Differential Revision: D17255548

Pulled By: zou3519

fbshipit-source-id: fdbdb3a12f24fbe37ce1e53ed09dc8a42589d928
2019-09-12 22:53:40 -07:00
vishwakftw
eee58f8284 Refactor torch.*solve tests (#25733)
Summary:
Changelog:
- De-duplicate the code in tests for torch.solve, torch.cholesky_solve, torch.triangular_solve
- Skip tests explicitly if requirements aren't met for e.g., if NumPy / SciPy aren't available in the environment
- Add generic helpers for these tests in test/common_utils.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25733

Test Plan:
- All tests should pass to confirm that the change is not erroneous

Clears one point specified in the discussion in https://github.com/pytorch/pytorch/issues/24333.

Differential Revision: D17315330

Pulled By: zou3519

fbshipit-source-id: c72a793e89af7e2cdb163521816d56747fd70a0e
2019-09-11 14:30:00 -07:00
Pavel Belevich
a14e884546 Migrate pow from TH to Aten (CUDA) (#25517)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24613

```
DEBUG = 0
OMP_NUM_THREADS = 1
Tesla M40

import torch

base = torch.randn(1000000, device='cuda:1')
exp  = torch.randn(1000000, device='cuda:1')
out  = torch.empty_like(base)

timeit base.pow(0)
old 53.1 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 18.7 µs ± 15 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit base.pow(1/3)
old 53.3 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.1 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/3)
old 53.3 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.1 µs ± 29.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1/2)
old 53.2 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 34.8 µs ± 40.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/2)
old 53.3 µs ± 54.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 42 µs ± 32.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1)
old 38.3 µs ± 53.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 40.1 µs ± 41.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1)
old 38.4 µs ± 29 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 35 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(2)
old 38.1 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 34.8 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-2)
old 38.3 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 35.2 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(3)
old 38.3 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 34.9 µs ± 46.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-3)
old 53.3 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.4 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(123456.789)
old 53.3 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.2 µs ± 24.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-123456.789)
old 53.5 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 51.3 µs ± 66.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(exp)
old 58.2 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 54.5 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp)
old 49.1 µs ± 89.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 58.7 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp)
old 48.7 µs ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 18.7 µs ± 88.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit torch.pow(-1, exp)
old 50.7 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 59.8 µs ± 100 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp)
old 49.4 µs ± 98 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 58.6 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp)
old 50.4 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 59.8 µs ± 48.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp, out=out)
old 49 µs ± 13 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 59.2 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp, out=out)
old 49.3 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 18.8 µs ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit torch.pow(-1, exp, out=out)
old 50.4 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 60.2 µs ± 71.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp, out=out)
old 49.2 µs ± 293 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 58.9 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp, out=out)
old 50.5 µs ± 150 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 60.1 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

base = (torch.rand(1000000, device='cuda:1') * 10).to(int)
exp  = (torch.rand(1000000, device='cuda:1') * 10).to(int)
out  = torch.empty_like(base)

timeit base.pow(0)
old 75.5 µs ± 10.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 33.8 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1/3)
old 75.5 µs ± 78.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 842 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/3)
old 75.5 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 843 µs ± 231 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1/2)
old 75.7 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 123 µs ± 71.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1/2)
old 76 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 180 µs ± 55.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(1)
old 74.1 µs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 72.3 µs ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-1.0)
old Integers to negative integer powers are not allowed.
new 86.9 µs ± 84.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(2)
old 74.2 µs ± 15.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 66.5 µs ± 28.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-2.0)
old Integers to negative integer powers are not allowed.
new 87.3 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(3)
old 74.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 66.5 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-3.0)
old Integers to negative integer powers are not allowed.
new 861 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(123456.789)
old 256 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 863 µs ± 64.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(-123456.789)
old Integers to negative integer powers are not allowed.
new 863 µs ± 57.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit base.pow(exp)
old 111 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 98.8 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp)
old 81.9 µs ± 23.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 92.9 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp)
old 81.9 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 33.6 µs ± 56.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-1, exp)
old 82.2 µs ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.6 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp)
old 82.1 µs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.8 µs ± 75.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp)
old 82.3 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 94 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(0, exp, out=out)
old 81.6 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.8 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(1, exp, out=out)
old 81.6 µs ± 26.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 33.7 µs ± 36.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-1, exp, out=out)
old 82.7 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.9 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(42, exp, out=out)
old 82.6 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 93.7 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit torch.pow(-42, exp, out=out)
old 82.5 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
new 94 µs ± 55.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25517

Differential Revision: D17251364

Pulled By: pbelevich

fbshipit-source-id: 20904c073c311e76285eaa1b68e67e67ea3c62d8
2019-09-10 13:46:22 -07:00
Ailing Zhang
26f67e7aa7 fix scatter CPU kernel when (input size, src size) > index size (#25839)
Summary:
fixes https://github.com/pytorch/pytorch/issues/25836
According to doc, https://pytorch.org/docs/stable/tensors.html#torch.Tensor.scatter_ `index` must have the smallest size and we should iterate over `index` instead of `tensor`.
cc: dlibenzi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25839

Differential Revision: D17269116

Pulled By: ailzhang

fbshipit-source-id: 0e8569fed6c0d2dd70e4e3ec5d29d8730cd2ae8f
2019-09-10 11:41:41 -07:00
Hong Xu
57b23c61c5 In the CUDA implementation of erfinv, erfinv() should be used for double (#25337)
Summary:
This best preserves accuracy, while erfinvf() should be used for half and float.

This is also consistent with the implementation before the migration: https://github.com/pytorch/pytorch/issues/24943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25337

Differential Revision: D17102333

Pulled By: zou3519

fbshipit-source-id: 5178cff534cf5f10d86ab04d4b6c1779ffedf49e
2019-09-10 06:30:33 -07:00
Igor Fedan
bf04c2ca2f Make torch checks same for both CPU and CUDA multinomial (#25595)
Summary:
Currently we have different checks for multinomial method on CPU and CUDA. This PR will make them consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25595

Differential Revision: D17236163

Pulled By: ifedan

fbshipit-source-id: 7718173bdaf216e8eb636c2a5b9c5939b975325b
2019-09-10 05:29:58 -07:00
vishwakftw
36bdde255e Fix test_det_logdet_slogdet_batched on PowerPC (#25773)
Summary:
Changelog:
- Simplify generation of singular matrices to just constructing a constant matrix instead of a random singular matrix using random_square_matrix_of_rank, which is susceptible to numerical issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25773

Test Plan:
- test_det_logdet_slogdet_batched should pass

Fixes https://github.com/pytorch/pytorch/issues/25172

cc: branfosj hartb

Apologies for the delay.

Differential Revision: D17261059

Pulled By: soumith

fbshipit-source-id: 8f991e2cb8c0e9dccad363d4785075213088e58a
2019-09-09 19:23:42 -07:00
Richard Zou
7970e5720b Rename tensor.view_names -> tensor.renamed (#25711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25711

This function renames the dimensions of a tensor out-of-place. Because
of that, I think `tensor.renamed(...)` is a clearer name: `view_names`
has the connotation that we can use names to `view` our tensors with a
"different shape", but what this function really does is let us rename a
tensor no matter the previous names.

`tensor.names_`, the in-place version of this, is unchanged for now.
However, we might delete this or not advertise it if it has no use case
and also because its naming is a little inconsistent with `tensor.renamed`.

Test Plan: - [namedtensor ci]

Differential Revision: D17206515

Pulled By: zou3519

fbshipit-source-id: 67053951fcc8130c84566b5ebbdce35ef619c90d
2019-09-06 11:28:04 -07:00
Brian Vaughan
88e4cee3e7 Improve handling of mixed-type tensor operations (#22273)
Summary:
Improve handling of mixed-type tensor operations.

This PR affects the arithmetic (add, sub, mul, and div) operators implemented via TensorIterator (so dense but not sparse tensor ops).

For these operators, we will now promote to reasonable types where possible, following the rules defined in https://github.com/pytorch/pytorch/issues/9515, and error in cases where the cast would require floating point -> integral or non-boolean to boolean downcasts.

The details of the promotion rules are described here:
https://github.com/nairbv/pytorch/blob/promote_types_strict/docs/source/tensor_attributes.rst

Some specific backwards incompatible examples:
* now `int_tensor * float` will result in a float tensor, whereas previously the floating point operand was first cast to an int. Previously `torch.tensor(10) * 1.9` => `tensor(10)` because the 1.9 was downcast to `1`. Now the result will be the more intuitive `tensor(19)`
* Now `int_tensor *= float` will error, since the floating point result of this operation can't be cast into the in-place integral type result.

See more examples/detail in the original issue (https://github.com/pytorch/pytorch/issues/9515), in the above linked tensor_attributes.rst doc, or in the test_type_promotion.py tests added in this PR:
https://github.com/nairbv/pytorch/blob/promote_types_strict/test/test_type_promotion.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22273

Reviewed By: gchanan

Differential Revision: D16582230

Pulled By: nairbv

fbshipit-source-id: 4029cca891908cdbf4253e4513c617bba7306cb3
2019-09-05 18:26:09 -07:00
Johannes M Dieterich
c6dd4036f5 Enable two tests that were skipped b/c of rocThrust bugs fixed in ROCm 2.7
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25724

Differential Revision: D17212373

Pulled By: bddppq

fbshipit-source-id: 2978bc13cdcd0e96a82c0019a08b589f67c0fe1d
2019-09-05 16:10:56 -07:00
iotamudelta
4fe857187c switch to rocThrust for thrust/cub APIs (#25620)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602

Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.

Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.

Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.

Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.

Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864

Reviewed By: xw285cornell

Differential Revision: D16940768

Pulled By: bddppq

fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
2019-09-03 22:16:30 -07:00
vishwakftw
d1e079e2e0 Enable torch.cholesky for batches > 262140 (#24438)
Summary:
Changelog:
- Iterate over mini batches of 262140 matrices (maximum)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24438

Test Plan:
- Added slow tests to test the behavior in test_torch and test_cuda

Fixes https://github.com/pytorch/pytorch/issues/24403

Differential Revision: D17175603

Pulled By: soumith

fbshipit-source-id: 1abb0a1e92494cf43ef4ba9efb54a919cd18bfef
2019-09-03 17:35:37 -07:00
vishwakftw
1e4832ffad Enable broadcasting of batch dimensions RHS and LHS tensors for lu_solve (#24333)
Summary:
Changelog:
- Enable broadcasting of RHS and LHS tensors for lu_solve. This means that you can now have RHS with size `3 x 2` and LHS with size `4 x 3 x 3` for instance
- Remove deprecated behavior of having 2D tensors for RHS. Now all tensors have to have a last dimension which equals the number of right hand sides
- Modified docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24333

Test Plan: - Add tests for new behavior in test_torch.py with a port to test_cuda.py

Differential Revision: D17165463

Pulled By: zou3519

fbshipit-source-id: cda5d5496ddb29ed0182bab250b5d90f8f454aa6
2019-09-03 15:14:48 -07:00
Richard Zou
5c4cc1e8f3 Prepare to add some Dimname/DimnameList overloads (#25405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25405

This PR adds schemas to native_functions.yaml, core/Tensor.h, and
core/TensorMethods.h for Dimname/DimnameList overloads for the following
functions:
- min, max, max_values, min_values
- mean, meadian
- logsumexp, std, var, norm

The actual implementations will come in a later PR. I am accumulating
all the addtional schemas and changes to core/{Tensor|TensorMethods}.h
in this PR so that there is only one point of failure for potential
merge conflicts.

Test Plan: - Check that all pytorch builds still build. [namedtensor ci]

Differential Revision: D17116333

Pulled By: zou3519

fbshipit-source-id: fd666d60109a311767169261afbec0fd85cc00c8
2019-09-03 10:55:47 -07:00
davidriazati
7a921ba17d Manually implement is_zipfile (#25279)
Summary:
The default implementation is lenient in that it recognizes a zipfile if the magic number appears anywhere in the archive. So if someone has the bytes `PK\x03\x04` in a tensor, it gets recognized as a zipfile. See https://bugs.python.org/issue28494

This implementation only checks the first 4 bytes of the file for the zip magic number. We could also copy https://github.com/python/cpython/pull/5053's fix, but that seems like overkill.

Fixes #25214
](https://our.intern.facebook.com/intern/diff/17102516/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25279

Pulled By: driazati

Differential Revision: D17102516

fbshipit-source-id: 4d09645bd97e9ff7136a2229fba1d9a1bce5665a
2019-08-30 16:47:50 -07:00
CamiWilliams
329757a907 Torch.flatten() returns a 1-dim tensor on a 0-dim tensor (#25406)
Summary:
PR for `torch.flatten()` to return a 1-dim tensor on a 0-dim tensor

> torch.tensor(123).shape -> torch.Size([])
> torch.tensor(123).flatten() -> torch.tensor([123])
> torch.tensor(123).flatten().shape -> torch.Size([1])

resolve https://github.com/pytorch/pytorch/issues/22963
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25406

Differential Revision: D17120464

Pulled By: CamiWilliams

fbshipit-source-id: efbecd61f0aefd82f2ab417ca6bb467488ff99de
2019-08-30 08:53:08 -07:00
Brian Vaughan
f0c6021846 fix bug in assertNotEqual for int tensors (#25412)
Summary:
re-apply: https://github.com/pytorch/pytorch/pull/25199
but without a failing quantized test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25412

Differential Revision: D17131303

Pulled By: nairbv

fbshipit-source-id: edf7736af3ede5e809eded72be9514e922e70db4
2019-08-30 06:52:30 -07:00
Jerry Zhang
e231bd16fb Revert D17112656: [pytorch][PR] fix bug in assertNotEqual for int tensors
Test Plan: revert-hammer

Differential Revision:
D17112656

Original commit changeset: 43e0e7da6d58

fbshipit-source-id: 0a0f7b8b125f24a45023ddb46fe144f21499b723
2019-08-29 10:36:56 -07:00
Hong Xu
2e1c37c95c Move the CUDA implementation of ceil to ATen. (#24866)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24866

Fix #24542

Test Plan: Imported from OSS

Differential Revision: D16965903

Pulled By: VitalyFedyunin

fbshipit-source-id: b9decaa58bec813a23d369b5e1eec627599f41da
2019-08-29 08:48:31 -07:00
Brian Vaughan
1e2b19db6d fix bug in assertNotEqual for int tensors (#25199)
Summary:
assertNotEqual was failing to detect differences in int tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25199

Differential Revision: D17112656

Pulled By: nairbv

fbshipit-source-id: 43e0e7da6d58eb1c837a508d462a748b2065bdd9
2019-08-29 07:32:50 -07:00
Gregory Chanan
f362a5a04b Revert "Let logical_xor support non-bool tensors." (#25269)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25269

This reverts commit 5ca612b55e.

Test Plan: Imported from OSS

Differential Revision: D17080088

fbshipit-source-id: e6b6215b713910c448e9a6b831b08f28b849c64a
2019-08-28 15:41:51 -07:00
SsnL
6100de9b1b implement bool_tensor.bernoulli_ (#25076)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/25072
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25076

Differential Revision: D17073453

Pulled By: ezyang

fbshipit-source-id: 42410da8c9911c1d7b3543bde740c7e66ae0cc1c
2019-08-28 12:25:27 -07:00
Igor Fedan
afb7a162fb Migrate erfinv and erfinv_ from the TH to Aten (CUDA) (#24943)
Summary:
https://github.com/pytorch/pytorch/issues/24560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24943

Differential Revision: D16996434

Pulled By: ifedan

fbshipit-source-id: 77111a4e47bb2b20f65225d48e7213cd77ddae19
2019-08-28 09:30:08 -07:00
Pavel Belevich
112f249446 Port pow operator from the TH code to Aten (#23492)
Summary:
Fixing https://github.com/pytorch/pytorch/issues/24750
```
DEBUG = 0
OMP_NUM_THREADS = 1

import torch

base = torch.randn(1000000)
exp  = torch.randn(1000000)
out  = torch.empty_like(base)

timeit base.pow(0)							+30x
old 6.26 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 213 µs ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(1/3)						+6x
old 56 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.41 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(-1/3)						+6x
old 57 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.49 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(1/2)						+6x
old 4.04 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 620 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-1/2)						+5x
old 6.56 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 1.24 ms ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(1)							no diff
old 322 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
new 331 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-1)							+3.5x
old 2.48 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 717 µs ± 130 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(2)							no diff
old 328 µs ± 7.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
new 324 µs ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-2)							+3.5x
old 2.45 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 662 µs ± 3.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(3)							+7x
old 2.39 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 334 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit base.pow(-3)							+9x
old 93.7 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 10.3 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(123456.789)					+5x
old 46.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.68 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(-123456.789)				+5x
old 46.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 10 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit base.pow(exp)						+6x
old 60.6 ms ± 4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.7 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(0, exp)					no diff
old 18.3 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 21.2 ms ± 333 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

timeit torch.pow(1, exp)					+30x
old 6.01 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 203 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit torch.pow(-1, exp)					+3x
old 30.8 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.67 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(42, exp)					+8x
old 80.1 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.51 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(-42, exp)					+2x
old 21.8 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.5 ms ± 89.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(0, exp, out=out)			no diff
old 20.2 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 22.1 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

timeit torch.pow(1, exp, out=out)			+30x
old 6.7 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new 203 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timeit torch.pow(-1, exp, out=out)			+3x
old 32.5 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.4 ms ± 99.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(42, exp, out=out)			+10x
old 91 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 9.64 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pow(-42, exp, out=out)			+2.5x
old 25.9 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
new 10.1 ms ± 698 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

```

BC: enforce stronger shape requirements on the output tensor (out= keyword argument) and do not allow output tensor to be resized if it is also used as one of the inputs.
BC: enforce stronger integer tensor base power integer exponent requirement on CPU and CUDA: `Integers to negative integer powers are not allowed.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23492

Differential Revision: D16731583

Pulled By: pbelevich

fbshipit-source-id: 4e5bf689357fe82a19371e42d48abbb7b4c1c3ca
2019-08-28 09:11:50 -07:00
Igor Fedan
9b1097958e Migrate digamma\digamma_\polygamma\polygamma_ from the TH to Aten (CPU) (#25048)
Summary:
https://github.com/pytorch/pytorch/issues/24612
https://github.com/pytorch/pytorch/issues/24550
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25048

Differential Revision: D16996440

Pulled By: ifedan

fbshipit-source-id: 0d76588d179d4c932e3fc284cb399dcfc77bc622
2019-08-28 08:29:13 -07:00
Patrick Donnelly
883628cb5c Added documentation for nn.functional.bilinear (#24951)
Summary:
Adds documentation for `nn.functional.bilinear`, as requested in https://github.com/pytorch/pytorch/issues/9886.

The format follows that of `nn.functional.linear`, and borrows from `nn.bilinear` in its description of `Tensor` shapes.

I am happy to add more extensive documentation (e.g. "Args," "Example(s)"). From what I gather, the format of comments is inconsistent across functions in `nn.functional.py` and between modules (e.g. `nn.functional` and `nn`). It's my first PR, so guidance for contributing documentation and other code would be greatly appreciated!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24951

Differential Revision: D17091261

Pulled By: soumith

fbshipit-source-id: efe2ad764700dfd6f30eedc03de4e1cd0d10ac72
2019-08-28 08:19:25 -07:00
Funtowicz Morgan
2c22076342 Moving sign function to ATen (#22861)
Summary:
This PR linked to https://github.com/pytorch/pytorch/issues/22806 moving sign function to ATen.

sign(x) supports bool,  and vectorized operation on CPU.
sign(NaN) is defined to return 0.
sign(bool) is a no-op, the resulting tensor will holds the same values than the input one.

- [x] CPU Backend
- [x] CUDA Backend
- [x] Bring support for bool dtype
- [x] Bring support for Half dtype
- [x] Add test for NaN
- [x] Add test for bool dtype
- [x] Delete legacy implementation in THTensorMoreMath.cpp

Performances:
```python
timeit -s 'import torch; x = torch.randn((1000, 1000))' -n 1000 'torch.sign(x)'
timeit -s 'import torch; x = torch.randn((1000, 1000), device="cuda")' -n 1000 'torch.sign(x); torch.cuda.synchronize()'
```

| device |  before  | after |
| :-------------: | :-------------: | :-----: |
| CPU    | 1.24 msec | 33.9 usec |
| GPU    | 680 usec | 7.13 usec  |
| CPU (1 thread) | 0.82 msec | 0.73 msec |
| GPU (1 thread) | 16.1 used | 15.9 usec |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22861

Differential Revision: D16503452

Pulled By: VitalyFedyunin

fbshipit-source-id: a87ce7fff139642ef4ed791f15873074ad0d53af
2019-08-27 19:01:34 -07:00
Pavel Belevich
30bc65271d torch.from_numpy fix for np.int (#25139)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/22615
Because of different sizeof(long) we have the following relations between NPY_TYPES and NPY_INTXX aliases:
```
int value	Enum			Unix		Windows
1		NPY_BYTE		NPY_INT8	NPY_INT8
3		NPY_SHORT		NPY_INT16	NPY_INT16
5		NPY_INT			NPY_INT32	-
7		NPY_LONG		NPY_INT64	NPY_INT32
9		NPY_LONGLONG		-		NPY_INT64
```
I suggest the following fix for `numpy_dtype_to_aten` method:
```
if (dtype == NPY_INT || dtype == NPY_INT32) {
	return kInt;
} else if (dtype == NPY_LONGLONG || dtype == NPY_INT64) {
	return kLong;
}
```
On Unix it will be replaced with:
Unix:
```
if (dtype == 5 || dtype == 5) {
	return kInt;
} else if (dtype == 9 || dtype == 7) {
	return kLong;
}
```
and on Windows with:
```
if (dtype == 5 || dtype == 7) {
	return kInt;
} else if (dtype == 9 || dtype == 9) {
	return kLong;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25139

Differential Revision: D17048443

Pulled By: pbelevich

fbshipit-source-id: 9f2c27ff2829b893a35d3d57f176a58e7749a468
2019-08-26 05:07:22 -07:00
Kexuan Sun
4b3ea92787 Test if descriptions of args are in the template (#24161)
Summary:
As in https://github.com/pytorch/pytorch/issues/23439, some descriptions of arguments in `_torch_docs.py` have been replaced by `common_args`, it would be helpful to check if any descriptions can be replaced for new docs in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24161

Differential Revision: D16889293

Pulled By: ezyang

fbshipit-source-id: bf6f581494482d6eb32e634f73e84a4586766230
2019-08-20 16:34:50 -07:00
Max Balandat
d33623f7c1 Make SobolEngine use random seed if not specified (#24884)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/24881. Makes behavior consistent with the rest of the random functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24884

Test Plan: Unit tests

Reviewed By: sdsingh

Differential Revision: D16912036

Pulled By: Balandat

fbshipit-source-id: eff00cca989926a5d9e20d8846a8674f7cd270cb
2019-08-20 09:22:41 -07:00
Pavel Belevich
6100205eb8 TensorIterator::binary_op input-output overlap check (#24058)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/8212

This fix is based on the idea that in-place ops(e.g. add_(...)) and out ops(e.g. tensor.add(..., out=...)) must check that the output tensor does not partially overlap with any of it's input tensors. Otherwise the result of such op is unexpected to the user. Since TensorIterator is a common backend for such ops and it's already used to check output self-overlapping, this fix is implemented in the same place.

MemOverlapStatus enum class is introduced to model two tensors overlapped state:

- TOO_HARD if at least one of them is not contiguous
- FULL if both are contiguous and share exactly the same memory array [data(), data() + numel() *itemsize()]
- PARTIAL is both are contiguous but underlying memory is shared partially, in other words memory arrays overlap but not identical.
- NO if both are contiguous but have independent non overlapping memory arrays

Performance test of clone/addcmul_/addcdiv_ with check_mem_overlaps:

a = torch.empty(10000000, device='cpu')
b = torch.randn(10000000, device='cpu')
timeit a.copy_(b)
master: 10.3 ms ± 429 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
branch: 10.2 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

a = torch.empty(10000000, device='cuda')
b = torch.randn(10000000, device='cuda')
timeit a.copy_(b)
master: 373 µs ± 97.9 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
branch: 373 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a = torch.randn(1000000, device='cpu')
b = torch.randn(1000000, device='cpu')
c = torch.randn(1000000, device='cpu')
timeit a.addcmul_(b, c)
master: 2.02 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
branch: 2.11 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a = torch.randn(1000000, device='cuda')
b = torch.randn(1000000, device='cuda')
c = torch.randn(1000000, device='cuda')
timeit a.addcmul_(b, c)
master: 72.6 µs ± 627 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	72.4 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(1000000, device='cpu')
b = torch.randn(1000000, device='cpu')
c = torch.randn(1000000, device='cpu')
timeit a.addcdiv_(b, c)
master: 2.19 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 1000 loop each)
branch:	1.97 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a = torch.randn(1000000, device='cuda')
b = torch.randn(1000000, device='cuda')
c = torch.randn(1000000, device='cuda')
timeit a.addcdiv_(b, c)
master: 71.3 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	71.7 µs ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.empty(100, device='cpu')
b = torch.randn(100, device='cpu')
timeit a.copy_(b)
master: 12.1 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
branch:	11.1 µs ± 61.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

a = torch.empty(100, device='cuda')
b = torch.randn(100, device='cuda')
timeit a.copy_(b)
master: 20.9 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	22.8 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cpu')
b = torch.randn(100, device='cpu')
c = torch.randn(100, device='cpu')
timeit a.addcmul_(b, c)
master: 24.1 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	24 µs ± 91.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cuda')
b = torch.randn(100, device='cuda')
c = torch.randn(100, device='cuda')
timeit a.addcmul_(b, c)
master: 34.5 µs ± 4.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	29.8 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cpu')
b = torch.randn(100, device='cpu')
c = torch.randn(100, device='cpu')
timeit a.addcdiv_(b, c)
master: 21.3 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	23.8 µs ± 403 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

a = torch.randn(100, device='cuda')
b = torch.randn(100, device='cuda')
c = torch.randn(100, device='cuda')
timeit a.addcdiv_(b, c)
master: 30.3 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
branch:	31.8 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24058

Differential Revision: D16767892

Pulled By: pbelevich

fbshipit-source-id: 0cdaaa471d003a2886b1736f8985842226b8493a
2019-08-19 15:06:04 -07:00
Vishwak Srinivasan
4358cbe01b Allow torch.tril / triu to handle bool and half inputs (#24163)
Summary:
Changelog:
- Enable torch.tril / triu for bool and float16 dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24163

Test Plan:
- Tests added in test_torch.py for all devices and dtypes (except bfloat16)

Fixes https://github.com/pytorch/pytorch/issues/24035

Differential Revision: D16793315

Pulled By: ezyang

fbshipit-source-id: 2bbc51ce567405a7cb2d8ab567eee6c2e40aa76a
2019-08-19 15:02:53 -07:00
vishwakftw
f849ebf1fe Enable torch.eye for bool and half (#24148)
Summary:
Changelog:
- Enable torch.eye for bool and float16 dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24148

Test Plan:
- Tests added in test_torch.py for all available devices and dtypes (except torch.bfloat16)

Fixes https://github.com/pytorch/pytorch/issues/24088

Differential Revision: D16891048

Pulled By: ezyang

fbshipit-source-id: 3e86fe271bd434300c396e63f82c1a1f3adac2b4
2019-08-19 14:59:37 -07:00
Iurii Zdebskyi
eee3e92936 Enabled torch.mm and torch.mv for bfloat16
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24224

Test Plan: Imported from OSS

Differential Revision: D16779996

Pulled By: izdeby

fbshipit-source-id: c859d8945a564edfa3f8a1430f140ae30d484d19
2019-08-16 15:46:15 -07:00
Heungsub Hans Lee
e166811598 Documentation for Tensor.record_stream() (#24078)
Summary:
This patch writes documentation for `Tensor.record_stream()`, which is not a documented API currently. I've discussed publishing it with colesbury in https://github.com/pytorch/pytorch/issues/23729.

The documentation is based on [the introduction at `CUDACachingAllocator.cpp`](25d1496d58/c10/cuda/CUDACachingAllocator.cpp (L47-L50)). ~~I didn't explain full details of the life cycle of memory blocks or stream awareness of the allocator for the consistent level of details with other documentations.~~ I explained about the stream awareness in a note block.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24078

Differential Revision: D16743526

Pulled By: zou3519

fbshipit-source-id: 05819c3cc96733e2ba93c0a7c0ca06933acb22f3
2019-08-16 08:07:33 -07:00
Hong Xu
5ca612b55e Let logical_xor support non-bool tensors.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23978

Test Plan: Imported from OSS

Differential Revision: D16719299

Pulled By: gchanan

fbshipit-source-id: 2fe170be6090733e20410db7cf99266543299c58
2019-08-15 12:21:31 -07:00