Commit Graph

250 Commits

Author SHA1 Message Date
Edward Yang
173f224570 Turn on F401: Unused import warning. (#18598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**

This was requested by someone at Facebook; this lint is turned
on for Facebook by default.  "Sure, why not."

I had to noqa a number of imports in __init__.  Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it.  Left for future work.

Be careful!  flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments.  flake8-3 will
report an import unused; flake8-2 will not.  For now, I just
noqa'd all these sites.

All the changes were done by hand.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D14687478

fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
2019-03-30 09:01:17 -07:00
Vishwak Srinivasan
e73be58ff7 Rename btriunpack to lu_unpack (#18529)
Summary:
Changelog:
- Renames `btriunpack` to `lu_unpack` to remain consistent with the `lu` function interface.
- Rename all relevant tests, fix callsites
- Create a tentative alias for `lu_unpack` under the name `btriunpack` and add a deprecation warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18529

Differential Revision: D14683161

Pulled By: soumith

fbshipit-source-id: 994287eaa15c50fd74c2f1c7646edfc61e8099b1
2019-03-29 13:01:30 -07:00
Vishwak Srinivasan
d859031ebf Rename btrifact* to lu (#18435)
Summary:
Changelog:

- Renames `btrifact` and `btrifact_with_info` to `lu`to remain consistent with other factorization methods (`qr` and `svd`).
- Now, we will only have one function and methods named `lu`, which performs `lu` decomposition. This function takes a get_infos kwarg, which when set to True includes a infos tensor in the tuple.
- Rename all tests, fix callsites
- Create a tentative alias for `lu` under the name `btrifact` and `btrifact_with_info`, and add a deprecation warning to not promote usage.
- Add the single batch version for `lu` so that users don't have to unsqueeze and squeeze for a single square matrix (see changes in determinant computation in `LinearAlgebra.cpp`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18435

Differential Revision: D14680352

Pulled By: soumith

fbshipit-source-id: af58dfc11fa53d9e8e0318c720beaf5502978cd8
2019-03-29 00:34:30 -07:00
jithunnair-amd
fdedc62c26 enable more unit tests (#18537)
Summary:
Enable unit tests working with ROCm 2.3. In particular, these are unit tests where we skipped for double data types previously and some tests for multi-GPU setups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18537

Differential Revision: D14651822

Pulled By: ezyang

fbshipit-source-id: 7dd575504ebe235a91489866c91000e9754b1235
2019-03-27 14:27:23 -07:00
Tongzhou Wang
5292685d2f Improve numerical precision of (s)logdet (#18449)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/18448 and https://github.com/pytorch/pytorch/issues/18450
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18449

Differential Revision: D14611638

Pulled By: soumith

fbshipit-source-id: 4f1f27ab5316a92d2783e734169f599afed743cf
2019-03-26 15:32:14 -07:00
vishwakftw
291746f110 Rename trtrs to triangular_solve (#18213)
Summary:
Changelog:
- Renames `trtrs` to `triangular_solve` to remain consistent with `cholesky_solve` and `solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `triangular_solve` under the name `trtrs`, and add a deprecation warning to not promote usage.
- Move `isnan` to _torch_docs.py
- Remove unnecessary imports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18213

Differential Revision: D14566902

Pulled By: ezyang

fbshipit-source-id: 544f57c29477df391bacd5de700bed1add456d3f
2019-03-21 14:27:21 -07:00
Vishwak Srinivasan
a519217ee7 Add batched version of trtrs (#18025)
Summary:
- Remove single batch TH/THC implementations
- Remove `_batch_trtrs_lower` from `multivariate_normal`
- Add tests for batched behavior
- Modify trtrs_backward to accommodate for batched case
- Modify docs

In a future PR, this will be renamed to `triangular_solve`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18025

Differential Revision: D14523004

Pulled By: ifedan

fbshipit-source-id: 11c6a967d107f969b60e5a5c73ce6bb8099ebbe1
2019-03-20 11:11:32 -07:00
Vishwak Srinivasan
421b508d55 Rename gesv to solve (#18060)
Summary:
Changelog:

- Renames `gesv` to `solve` to remain consistent with `cholesky_solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `solve` under the name `gesv`, and add a deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18060

Differential Revision: D14503117

Pulled By: zou3519

fbshipit-source-id: 99c16d94e5970a19d7584b5915f051c030d49ff5
2019-03-18 16:04:24 -07:00
Richard Zou
3c977fb7ce Error out on in-place (unary) ops on tensors that have internal overlap (#17927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17927
ghimport-source-id: 626d321e430b6b5c0ea3aa1eb9df8c1e2d058bf8

Stack:
* #17926 Implement at::has_internal_overlap helper function
* **#17927 Error out on in-place (unary) ops on tensors that have internal overlap**

On the way to #17935.

Works for CPU and CUDA on the following ops:
- abs_, acos_, asin_, atan_, ceil_, cos_, erf_, erfc_, exp_, expm1_
- floor_, log_, log10_, log1p_, log2_, round_, rsqrt_,
- sin_, sqrt_, tan_, tanh_, trunc_

This PR adds a check to see if the out/result tensor has internal
overlap. If it does, then we error out because the result **may** be
incorrect.

This is overly conservative; there are some cases where if the result is
the same as the input, the inplace operation is OK (such as floor_,
round_, and trunc_). However, the current code isn't organized in such a
way that this is easy to check, so enabling those will come in the future.

Reviewed By: ezyang

Differential Revision: D14438871

fbshipit-source-id: 15e12bf1fdb2ab7f74bb806e22bc74840bd6abd1
2019-03-15 07:50:19 -07:00
J M Dieterich
1ba1ca0acb Update to ROCm2.2 (#18007)
Summary:
ROCm 2.2 was released today, if we respin the CI docker images with the attached, PyTorch/Caffe2 will support ROCm 2.2

Changes necessary:
* for the Ubuntu target, HIP PR 934 needs to be applied to fix the forceinline definition. ROCm 2.3 will contain this.
* two unit tests proof flaky on different platforms, disable them defensively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18007

Differential Revision: D14473903

Pulled By: bddppq

fbshipit-source-id: b1939f11d1c765a3bf71bb244b15f6ceb0e816d3
2019-03-14 18:47:22 -07:00
vaeksare
40a3e14ade Disable btri tests on Windows if MAGMA is not found (#17989)
Summary:
Fixes #17988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17989

Reviewed By: ezyang

Differential Revision: D14454571

Pulled By: soumith

fbshipit-source-id: fc39a807a597d3574f4ca4e22cea12194e4693c0
2019-03-14 07:22:55 -07:00
Thomas Viehmann
aba9051a65 kthvalue consistency with sort in the presence of NaN (#17824)
Summary:
This PR causes kthvalue to be consistent with sort
(i.e. treat NaN as larger than any number), so that
`a.kthvalue(n) == a.sort()[n - 1]`.

One drawback is that median with a NaN argument does not return NaN,
which is a deviation from NumPy.

Thank you, ngimel, for raising this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17824

Differential Revision: D14410092

Pulled By: ezyang

fbshipit-source-id: bdec2d8272dc4c65bcf2f9b8995e237774c44c02
2019-03-12 08:49:19 -07:00
vishwakftw
9d70e199f4 Move lerp to ATen, add functionality for tensor weights (#17348)
Summary:
Changelog:
- Remove TH/THC bindings
- Add tensor weights for `lerp`
- Modify derivatives appropriately
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17348

Differential Revision: D14355845

Pulled By: soumith

fbshipit-source-id: eaede4c09ee589d77ba6cf52583510ea8e3a2fcf
2019-03-07 14:04:58 -08:00
jwu
8ec7357312 fix different round behavior on CPU and GPU #16498 (#17443)
Summary:
xxtemp, colesbury, bhushan23, zou3519,  convert gpu round behavior to half-to-even, consistent with torch cpu version and numpy. You feedback are welcomed.
See #16498
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17443

Differential Revision: D14261786

Pulled By: VitalyFedyunin

fbshipit-source-id: 98156436b545d72769831a89e2775d43ad913ebc
2019-03-06 19:40:10 -08:00
Shen Li
1154506533 Always synchronize src and dst streams when copying tensors (#16966)
Summary:
fixes #15568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16966

Differential Revision: D14213144

Pulled By: mrshenli

fbshipit-source-id: 2fcf5e07895fde80b4aee72e2736b0def876d21f
2019-02-27 14:57:56 -08:00
Johannes M Dieterich
76828647c1 Enable tests working on ROCm 2.1 dual gfx906
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17473

Reviewed By: bddppq

Differential Revision: D14210243

Pulled By: ezyang

fbshipit-source-id: 519032a1e73c13ecb260ea93102dc8efb645e070
2019-02-26 20:41:16 -08:00
Shen Li
b527055fcf Restore current streams on dst device after switching streams (#17439)
Summary:
When switching back to `d0` from a stream on a different device `d1`, we need to restore the current streams on both `d0` and `d1`. The current implementation only does that for `d0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17439

Differential Revision: D14208919

Pulled By: mrshenli

fbshipit-source-id: 89f2565b9977206256efbec42adbd789329ccad8
2019-02-25 12:06:41 -08:00
surgan12
fad9eda7fb Optional arg fixes (#17222)
Summary:
fixes #17210.
cc : ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17222

Differential Revision: D14130833

Pulled By: soumith

fbshipit-source-id: 19ff6020c47208e3436ae28cd16110a0f435b25e
2019-02-19 04:39:18 -08:00
jiej
b5193b6a81 Second PR to restore reverted commit (#16224) (#17040)
Summary:
update:
  1. global_reduce check for should_block_y_reduce first.
     This avoids the enabling global_reduce without block_y_reduce. Leading to
     accessing shared memory during global reduce without allocation.
  2. updating block_y_reduce heuristics. Improves perf on tiny tensors
  3. adding test case covering old cases where illegal memory access might occur

  TensorIterator cuda launch configs update (#16224)
    Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible
    block dimension to improve efficiency for reduction cases with small fast
    dimension.

    Previously TensorIterator launches blocks with fixed 32x16 threads.
    For cases like:

      import torch
      torch.randn(2**20, 4, device='cuda').sum(0)

    The fixed launch config does handle coalesced memory access efficiently.

    Updated launch configure enables flexible block dimension. Combining with
    improved reduction scheme (using flexible vertical / horizontal reduction
    instead of limited warp / block reduction in the old code), it ensures optimal
    memory access pattern even with reduction on dimension with small stride.

    Possible future improvements:
    1. Precise dynamic shared memory allocation.
    2. Using warp shuffle for vertical (block_y) reduction.
    Pull Request resolved: https://github.com/pytorch/pytorch/pull/16224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17040

Differential Revision: D14078295

Pulled By: umanwizard

fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c
2019-02-14 15:23:01 -08:00
Johannes M Dieterich
3e1e5d5a8b enable unit tests in test_cuda that now pass with ROCm 2.1
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17012

Differential Revision: D14059761

Pulled By: bddppq

fbshipit-source-id: 8309c3ffe1efed42b5db69fdec26427413c3f224
2019-02-12 17:28:46 -08:00
vishwakftw
0d95028bee Dispatch the correct legacy function for geqrf_out and ormqr_out (#16964)
Summary:
This fixes the segfault.

Changelog:
- Modify the function calls in LegacyDefinitions for `geqrf_out` and `ormqr_out`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16964

Differential Revision: D14025985

Pulled By: gchanan

fbshipit-source-id: aa50e2c1694cbf3642273ee14b09ba12625c7d33
2019-02-12 13:48:51 -08:00
Johannes M Dieterich
23e1c55cc0 enable unit tests working on ROCm 2.1 (#16871)
Summary:
This is the first round of enabling unit tests that work on ROCm 2.1 in my tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16871

Differential Revision: D13997662

Pulled By: bddppq

fbshipit-source-id: d909a3f7dd5fc8f85f126bf0613751c8e4ef949f
2019-02-09 00:30:50 -08:00
vishwakftw
6d86bc7c3f Fix issue with scalars and __rpow__ (#16687)
Summary:
Changelog:

- Modify __rpow__ function in tensor.py to adapt to scalars
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16687

Differential Revision: D13936720

Pulled By: soumith

fbshipit-source-id: b0c8727968b04efbc6e7461807c812d962f03370
2019-02-02 18:55:51 -08:00
Jacie Fan
a7796bc24d CUDA histogram implementation
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15842

Reviewed By: zou3519

Differential Revision: D13868982

Pulled By: jaciefan

fbshipit-source-id: bce81dc121c4538d204047506f8f14d0b4d8f905
2019-01-30 11:36:20 -08:00
Shen Li
7ce634ebc2 Relax lower bound for nogil timing test to avoid false alarm (#16259)
Summary:
fixes #16250, #16271
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16259

Differential Revision: D13784505

Pulled By: mrshenli

fbshipit-source-id: 0b7ad98cd3c018b9907d70158de3abc3c4cb57ef
2019-01-24 17:16:02 -08:00
Shen Li
2235fb256e Add default_stream() and enhance current_stream() (#16200)
Summary:
Closes #16156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16200

Differential Revision: D13747455

Pulled By: mrshenli

fbshipit-source-id: 00c0d5f341c3ac7a757bdb4631a17e11fbc6d3ec
2019-01-22 14:35:19 -08:00
Shen Li
1c058de9ac Release GIL when synchronize or wait (#16182)
Summary:
address the second future work item in #15937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16182

Differential Revision: D13744972

Pulled By: mrshenli

fbshipit-source-id: e9812e3fd4a5623e99b639d9f334bfc2d1827d92
2019-01-22 13:29:07 -08:00
Shen Li
898329c3f9 Unify device() return type in Stream, Event, and Tensor (#16150)
Summary:
Addresses one future work item in #15937
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16150

Differential Revision: D13732299

Pulled By: mrshenli

fbshipit-source-id: 4d0b35df573a3bf92dea6e2e7eb42fe8bac77b18
2019-01-19 23:01:31 -08:00
Shen Li
292edfb087 Change current device in stream context manager if necessary (#16128)
Summary:
Fixes #16019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16128

Differential Revision: D13721850

Pulled By: mrshenli

fbshipit-source-id: 422c6c0b97c1cd46e127e265b532cb8c74a3aac5
2019-01-18 12:39:51 -08:00
Shen Li
24f4d3987e Move all Stream and Event Python implementation to C++ (#15937)
Summary:
1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation.
2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++
3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937

Differential Revision: D13649001

Pulled By: mrshenli

fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240
2019-01-17 07:29:22 -08:00
jiej
7c56db73d5 Moving torch.norm to ATen using TensorIterator (#15414)
Summary:
Adding supports for torch.nomr:
i. multi dimensions for dim
ii. dtype that specifies math/output tensor type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15414

Differential Revision: D13702022

Pulled By: ezyang

fbshipit-source-id: da2676f2b6aff988889b1539d0de8ecd4946823a
2019-01-16 22:15:25 -08:00
Thomas Viehmann
d33e7d1236 multinomial: fix detection of zero probability (#16075)
Summary:
The cumsum over the probabilities can be not monotonically
non-decreasing. Thus it is hard to detect zero probability
classes using just the cumsum.
This changes the binary search postprocessing to use the
(non-cumulated) distribution instead.

Thank you, jcjohnson, for the bug report with
reproducing case.

Fixes: #13867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16075

Differential Revision: D13695565

Pulled By: soumith

fbshipit-source-id: 02c4d6f868f0050c1ae7d333f4317c5610e49cd9
2019-01-16 12:50:49 -08:00
Brennan Vincent
fb68d813be Fix logic errors when accumulating reductions in output (CUDA) (#16023)
Summary:
The correct logic is as follows:

* If there is an earlier split, we need to combine with its result
* If there is *not* a later split, we need to project before saving into the output.

This should partially f i x #15837  . For example:
```
In [7]: a=torch.ones([1838860800], dtype=torch.float, device="cuda:1")

In [8]: a.mean()
Out[8]: tensor(1., device='cuda:1')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16023

Differential Revision: D13678449

Pulled By: umanwizard

fbshipit-source-id: ab5078484c88e96bb30121b5cf24a0e8b0a8c2f8
2019-01-15 19:57:57 -08:00
SsnL
300dcc3b96 Add cuda.reset_max_memory_* (#15985)
Summary:
Addresses #15968
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15985

Differential Revision: D13649916

Pulled By: soumith

fbshipit-source-id: a207aea5709a79dba7a6fc541d0a70103f49efff
2019-01-14 07:31:51 -08:00
vishwakftw
b4c3268b23 Batched upper triangular, lower triangular (#15257)
Summary:
Changelog:

- Implements `triu` and `tril` for batches of 2D tensors.
- Remove TH/THC binding for `tril`
- Fix CUDA implementation
- Update docstrings for tril and triu.
- Remove mask-based `triu` and `tril` in cholesky forward and backward.
- Remove batched tril in torch.distributions.utils
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15257

Differential Revision: D13613888

Pulled By: mrshenli

fbshipit-source-id: 0949a05b9b8e974c1acfaf02a6284848ec5cc1c4
2019-01-09 19:46:39 -08:00
Shen Li
7b9f794580 Wrap C10 CUDAStream instead of cudaStream_t in THCPStream
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15833

Differential Revision: D13608337

Pulled By: mrshenli

fbshipit-source-id: 4c66ef89fad0dc14a11ddb69da92907797cd2828
2019-01-09 15:12:48 -08:00
Shen Li
1e9a6d7192 A quick fix for Stream operation errors on non-current device (#15689)
Summary:
see #15682

This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner.

~Test TBA~ Added

FYI jeffreyksmithjr

now

```python
In [1]: def f():
   ...:     d0 = torch.device('cuda:0')
   ...:     d1 = torch.device('cuda:1')
   ...:     with torch.cuda.device(d0):
   ...:         s0 = torch.cuda.current_stream()
   ...:     with torch.cuda.device(d1):
   ...:         s1 = torch.cuda.current_stream()
   ...:     s0.query()
   ...:     s1.query()

In [4]: %timeit f()
38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit f()
37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

before

```python
In [4]: %timeit f()
28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit f()
35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689

Differential Revision: D13571697

Pulled By: mrshenli

fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c
2019-01-03 15:14:58 -08:00
Natalia Gimelshein
e2549cbc01 initialize with ident value in global reduction (#15653)
Summary:
Fixes #15647. cc colesbury.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15653

Differential Revision: D13571132

Pulled By: soumith

fbshipit-source-id: 8f25943c974b3b931f4528e0e0a370bc095dab51
2019-01-02 19:52:57 -08:00
surgan12
b52420742d clamp fixes (#15479)
Summary: fix to #15338 .

Differential Revision: D13564343

Pulled By: soumith

fbshipit-source-id: be64b572945533e10ae6f627d335b47f093720a3
2019-01-01 23:12:17 -08:00
vishwakftw
7bb41e3953 Make btriunpack work for high dimensional batches and faster than before (#15286)
Summary:
Changelog:
- Optimize btriunpack by using `torch.where` instead of indexing, inplace operations instead of out place operations and avoiding costly permutations by computing the final permutation over a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15286

Differential Revision: D13562038

Pulled By: soumith

fbshipit-source-id: e2c94cfab5322bf1d24bf56d7b056619f553acc6
2018-12-30 12:42:07 -08:00
Vishwak Srinivasan
9c8d8eab9d Remove TH/THC link for gesv (#15510)
Summary:
This PR removes the TH/THC binding for gesv.

Changelog:
- Remove TH/THC binding
- Port single matrix case to ATen
- Enable test_gesv for CUDA as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15510

Differential Revision: D13559990

Pulled By: soumith

fbshipit-source-id: 9da2825e94d3103627e719709e6b1f8b521a07fb
2018-12-28 16:54:27 -08:00
Frank Zhang
d4712ee218 Added correct isinf handling for Integral tensors (#15489)
Summary:
Currently torch.isinf on integral tensor will raise RuntimeError: value cannot be converted to type int16_t without overflow: inf.
This pr will suppress the error and return false(0) for all integral tensors. The behavior will also be consistent with np.isinf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15489

Reviewed By: zou3519

Differential Revision: D13540786

Pulled By: flashhack

fbshipit-source-id: e730dea849da6a59f3752d347bcfbadfd12c6483
2018-12-26 06:36:09 -08:00
Shen Li
06a7cb5901 Implementing cuda kernel for tril_indices and triu_indices (#15203)
Summary:
Followup PR of #14904, and the stretch goal of #12653.

Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor.

The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers.

Algorithm details are describe in [comments of TensorFactories.cu](23ddb6f58a/aten/src/ATen/native/cuda/TensorFactories.cu (L109-L255)).

zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203

Reviewed By: zou3519

Differential Revision: D13517695

Pulled By: mrshenli

fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea
2018-12-20 10:23:38 -08:00
vishwakftw
41e7e1bc40 Rename potrs to cholesky_solve (#15334)
Summary:
Changelog:
- Renames `potrs` to `cholesky_solve` to remain consistent with Tensorflow and Scipy (not really, they call their function chol_solve)
- Default argument for upper in cholesky_solve is False. This will allow a seamless interface between `cholesky` and `cholesky_solve`, since the `upper` argument in both function are the same.
- Rename all tests
- Create a tentative alias for `cholesky_solve` under the name `potrs`, and add deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15334

Differential Revision: D13507724

Pulled By: soumith

fbshipit-source-id: b826996541e49d2e2bcd061b72a38c39450c76d0
2018-12-19 12:31:24 -08:00
Jie
bd958cde68 [TensorIterator fixing mean to output correct result for half precisi… (#14878)
Summary:
…on](#12115)

mean is calculated in two step sum()/numel(). For half precision, data gets
casted back to half after sum().
We fused the division into the reduction kernel by adding pre_op/post_op.

This allows us to do torch.ones(65536).cuda().half().mean() to return correct
result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14878

Differential Revision: D13491159

Pulled By: soumith

fbshipit-source-id: e83802e1628b6d2615c45e18d7acf991d143a09e
2018-12-17 20:13:30 -08:00
Chaitanya Sri Krishna Lolla
9f1d8f2eeb enabled tests in test_nn, test_cuda and test_sparse (#15232)
Summary:
tests work on ROCm 1.9.2 as present on CI (fp16 bringup, hipMemset and sparse improvements)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15232

Differential Revision: D13470991

Pulled By: bddppq

fbshipit-source-id: 45acc4f9ea5baaaf7672b86eb022948055779925
2018-12-14 14:27:57 -08:00
Shen Li
90f9e8103c Implement torch.tril_indices and torch.triu_indices (#12653) (#14904)
Summary:
This is an optimized implementation that does the following:

1. created an empty Tensor of correct size.
2. fill the Tensor with correct values.

The following three designs to fill in the Tensor result in roughly the same performance. Hence, the 2nd option is taken for simpler code, and to return contiguous tensors.

1. Sequential: fill row coordinates first, then columns. This results in two for-loop and more arithmetic operations.
2. Interleaved: fill in index coordinates one by one, which jumps between the two output Tensor rows in every iteration.
3. Transpose: create a n X 2 Tensor, fill the Tensor sequentially, and then transpose it.

<img width="352" alt="screen shot 2018-12-10 at 3 54 39 pm" src="https://user-images.githubusercontent.com/16999635/49769172-07bd3580-fc94-11e8-8164-41839185e9f9.png">

NOTE:

This implementation returns a 2D tensor, instead of a tuple of two tensors. It means that users will not be able to do the following:

```python
x = torch.ones(3, 3)
i = torch.tril_indices(3, 3)
x[i]  # need to first convert the 2D tensor into a tuple of two 1D tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14904

Reviewed By: zou3519

Differential Revision: D13433027

Pulled By: mrshenli

fbshipit-source-id: 41c876aafcf584832d7069f7c5929ffb59e0ae6a
2018-12-12 15:40:14 -08:00
SsnL
fab8085111 _get_device_index supports parsing device strings
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14929

Reviewed By: weiyangfb

Differential Revision: D13394498

Pulled By: soumith

fbshipit-source-id: 948c6118abdf6c1e1a8a17709333954cafb2345e
2018-12-09 21:12:46 -08:00
Johannes M Dieterich
52942e1f09 Enable unit tests known to work on ROCm (#14011)
Summary:
* Enable unit tests known to work on ROCm.
* Disable a few that are known to be flaky for the time being.
* Use std::abs for Half
* No more special casing for ROCm in TensorMathReduce
* Document an important detail for a hardcoded block size w.r.t. ROCm in TensorMathReduce

ezyang bddppq for awareness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14011

Differential Revision: D13387679

Pulled By: bddppq

fbshipit-source-id: 4177f2a57b09d866ccbb82a24318f273e3292f71
2018-12-07 18:57:32 -08:00
Jie
d2fdc33411 (#14580)
Summary:
Removes cast of half to float in torch.sum, with float16 input tensor and
float32 output tensor, instead we cast data when loading input in kernel.

This supposingly would save a kernel launch as well as a full global memory load
on promoted data type (float).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14580

Differential Revision: D13356203

Pulled By: ezyang

fbshipit-source-id: 85e91225b880a65fe3ceb493371b9b36407fdf48
2018-12-06 09:03:46 -08:00