Commit Graph

219 Commits

Author SHA1 Message Date
Thomas Viehmann
d33e7d1236 multinomial: fix detection of zero probability (#16075)
Summary:
The cumsum over the probabilities can be not monotonically
non-decreasing. Thus it is hard to detect zero probability
classes using just the cumsum.
This changes the binary search postprocessing to use the
(non-cumulated) distribution instead.

Thank you, jcjohnson, for the bug report with
reproducing case.

Fixes: #13867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16075

Differential Revision: D13695565

Pulled By: soumith

fbshipit-source-id: 02c4d6f868f0050c1ae7d333f4317c5610e49cd9
2019-01-16 12:50:49 -08:00
Brennan Vincent
fb68d813be Fix logic errors when accumulating reductions in output (CUDA) (#16023)
Summary:
The correct logic is as follows:

* If there is an earlier split, we need to combine with its result
* If there is *not* a later split, we need to project before saving into the output.

This should partially f i x #15837  . For example:
```
In [7]: a=torch.ones([1838860800], dtype=torch.float, device="cuda:1")

In [8]: a.mean()
Out[8]: tensor(1., device='cuda:1')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16023

Differential Revision: D13678449

Pulled By: umanwizard

fbshipit-source-id: ab5078484c88e96bb30121b5cf24a0e8b0a8c2f8
2019-01-15 19:57:57 -08:00
SsnL
300dcc3b96 Add cuda.reset_max_memory_* (#15985)
Summary:
Addresses #15968
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15985

Differential Revision: D13649916

Pulled By: soumith

fbshipit-source-id: a207aea5709a79dba7a6fc541d0a70103f49efff
2019-01-14 07:31:51 -08:00
vishwakftw
b4c3268b23 Batched upper triangular, lower triangular (#15257)
Summary:
Changelog:

- Implements `triu` and `tril` for batches of 2D tensors.
- Remove TH/THC binding for `tril`
- Fix CUDA implementation
- Update docstrings for tril and triu.
- Remove mask-based `triu` and `tril` in cholesky forward and backward.
- Remove batched tril in torch.distributions.utils
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15257

Differential Revision: D13613888

Pulled By: mrshenli

fbshipit-source-id: 0949a05b9b8e974c1acfaf02a6284848ec5cc1c4
2019-01-09 19:46:39 -08:00
Shen Li
7b9f794580 Wrap C10 CUDAStream instead of cudaStream_t in THCPStream
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15833

Differential Revision: D13608337

Pulled By: mrshenli

fbshipit-source-id: 4c66ef89fad0dc14a11ddb69da92907797cd2828
2019-01-09 15:12:48 -08:00
Shen Li
1e9a6d7192 A quick fix for Stream operation errors on non-current device (#15689)
Summary:
see #15682

This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner.

~Test TBA~ Added

FYI jeffreyksmithjr

now

```python
In [1]: def f():
   ...:     d0 = torch.device('cuda:0')
   ...:     d1 = torch.device('cuda:1')
   ...:     with torch.cuda.device(d0):
   ...:         s0 = torch.cuda.current_stream()
   ...:     with torch.cuda.device(d1):
   ...:         s1 = torch.cuda.current_stream()
   ...:     s0.query()
   ...:     s1.query()

In [4]: %timeit f()
38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit f()
37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

before

```python
In [4]: %timeit f()
28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit f()
35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689

Differential Revision: D13571697

Pulled By: mrshenli

fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c
2019-01-03 15:14:58 -08:00
Natalia Gimelshein
e2549cbc01 initialize with ident value in global reduction (#15653)
Summary:
Fixes #15647. cc colesbury.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15653

Differential Revision: D13571132

Pulled By: soumith

fbshipit-source-id: 8f25943c974b3b931f4528e0e0a370bc095dab51
2019-01-02 19:52:57 -08:00
surgan12
b52420742d clamp fixes (#15479)
Summary: fix to #15338 .

Differential Revision: D13564343

Pulled By: soumith

fbshipit-source-id: be64b572945533e10ae6f627d335b47f093720a3
2019-01-01 23:12:17 -08:00
vishwakftw
7bb41e3953 Make btriunpack work for high dimensional batches and faster than before (#15286)
Summary:
Changelog:
- Optimize btriunpack by using `torch.where` instead of indexing, inplace operations instead of out place operations and avoiding costly permutations by computing the final permutation over a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15286

Differential Revision: D13562038

Pulled By: soumith

fbshipit-source-id: e2c94cfab5322bf1d24bf56d7b056619f553acc6
2018-12-30 12:42:07 -08:00
Vishwak Srinivasan
9c8d8eab9d Remove TH/THC link for gesv (#15510)
Summary:
This PR removes the TH/THC binding for gesv.

Changelog:
- Remove TH/THC binding
- Port single matrix case to ATen
- Enable test_gesv for CUDA as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15510

Differential Revision: D13559990

Pulled By: soumith

fbshipit-source-id: 9da2825e94d3103627e719709e6b1f8b521a07fb
2018-12-28 16:54:27 -08:00
Frank Zhang
d4712ee218 Added correct isinf handling for Integral tensors (#15489)
Summary:
Currently torch.isinf on integral tensor will raise RuntimeError: value cannot be converted to type int16_t without overflow: inf.
This pr will suppress the error and return false(0) for all integral tensors. The behavior will also be consistent with np.isinf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15489

Reviewed By: zou3519

Differential Revision: D13540786

Pulled By: flashhack

fbshipit-source-id: e730dea849da6a59f3752d347bcfbadfd12c6483
2018-12-26 06:36:09 -08:00
Shen Li
06a7cb5901 Implementing cuda kernel for tril_indices and triu_indices (#15203)
Summary:
Followup PR of #14904, and the stretch goal of #12653.

Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor.

The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers.

Algorithm details are describe in [comments of TensorFactories.cu](23ddb6f58a/aten/src/ATen/native/cuda/TensorFactories.cu (L109-L255)).

zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203

Reviewed By: zou3519

Differential Revision: D13517695

Pulled By: mrshenli

fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea
2018-12-20 10:23:38 -08:00
vishwakftw
41e7e1bc40 Rename potrs to cholesky_solve (#15334)
Summary:
Changelog:
- Renames `potrs` to `cholesky_solve` to remain consistent with Tensorflow and Scipy (not really, they call their function chol_solve)
- Default argument for upper in cholesky_solve is False. This will allow a seamless interface between `cholesky` and `cholesky_solve`, since the `upper` argument in both function are the same.
- Rename all tests
- Create a tentative alias for `cholesky_solve` under the name `potrs`, and add deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15334

Differential Revision: D13507724

Pulled By: soumith

fbshipit-source-id: b826996541e49d2e2bcd061b72a38c39450c76d0
2018-12-19 12:31:24 -08:00
Jie
bd958cde68 [TensorIterator fixing mean to output correct result for half precisi… (#14878)
Summary:
…on](#12115)

mean is calculated in two step sum()/numel(). For half precision, data gets
casted back to half after sum().
We fused the division into the reduction kernel by adding pre_op/post_op.

This allows us to do torch.ones(65536).cuda().half().mean() to return correct
result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14878

Differential Revision: D13491159

Pulled By: soumith

fbshipit-source-id: e83802e1628b6d2615c45e18d7acf991d143a09e
2018-12-17 20:13:30 -08:00
Chaitanya Sri Krishna Lolla
9f1d8f2eeb enabled tests in test_nn, test_cuda and test_sparse (#15232)
Summary:
tests work on ROCm 1.9.2 as present on CI (fp16 bringup, hipMemset and sparse improvements)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15232

Differential Revision: D13470991

Pulled By: bddppq

fbshipit-source-id: 45acc4f9ea5baaaf7672b86eb022948055779925
2018-12-14 14:27:57 -08:00
Shen Li
90f9e8103c Implement torch.tril_indices and torch.triu_indices (#12653) (#14904)
Summary:
This is an optimized implementation that does the following:

1. created an empty Tensor of correct size.
2. fill the Tensor with correct values.

The following three designs to fill in the Tensor result in roughly the same performance. Hence, the 2nd option is taken for simpler code, and to return contiguous tensors.

1. Sequential: fill row coordinates first, then columns. This results in two for-loop and more arithmetic operations.
2. Interleaved: fill in index coordinates one by one, which jumps between the two output Tensor rows in every iteration.
3. Transpose: create a n X 2 Tensor, fill the Tensor sequentially, and then transpose it.

<img width="352" alt="screen shot 2018-12-10 at 3 54 39 pm" src="https://user-images.githubusercontent.com/16999635/49769172-07bd3580-fc94-11e8-8164-41839185e9f9.png">

NOTE:

This implementation returns a 2D tensor, instead of a tuple of two tensors. It means that users will not be able to do the following:

```python
x = torch.ones(3, 3)
i = torch.tril_indices(3, 3)
x[i]  # need to first convert the 2D tensor into a tuple of two 1D tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14904

Reviewed By: zou3519

Differential Revision: D13433027

Pulled By: mrshenli

fbshipit-source-id: 41c876aafcf584832d7069f7c5929ffb59e0ae6a
2018-12-12 15:40:14 -08:00
SsnL
fab8085111 _get_device_index supports parsing device strings
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14929

Reviewed By: weiyangfb

Differential Revision: D13394498

Pulled By: soumith

fbshipit-source-id: 948c6118abdf6c1e1a8a17709333954cafb2345e
2018-12-09 21:12:46 -08:00
Johannes M Dieterich
52942e1f09 Enable unit tests known to work on ROCm (#14011)
Summary:
* Enable unit tests known to work on ROCm.
* Disable a few that are known to be flaky for the time being.
* Use std::abs for Half
* No more special casing for ROCm in TensorMathReduce
* Document an important detail for a hardcoded block size w.r.t. ROCm in TensorMathReduce

ezyang bddppq for awareness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14011

Differential Revision: D13387679

Pulled By: bddppq

fbshipit-source-id: 4177f2a57b09d866ccbb82a24318f273e3292f71
2018-12-07 18:57:32 -08:00
Jie
d2fdc33411 (#14580)
Summary:
Removes cast of half to float in torch.sum, with float16 input tensor and
float32 output tensor, instead we cast data when loading input in kernel.

This supposingly would save a kernel launch as well as a full global memory load
on promoted data type (float).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14580

Differential Revision: D13356203

Pulled By: ezyang

fbshipit-source-id: 85e91225b880a65fe3ceb493371b9b36407fdf48
2018-12-06 09:03:46 -08:00
Francisco Massa
2d958b7f77 Storage.clone maintains original device (#14751)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/14673

As pointed out by vishwakftw , the root case of the `deepcopy` issue was that `storage.clone()` would create a new storage in the default device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14751

Reviewed By: soumith

Differential Revision: D13323061

Pulled By: fmassa

fbshipit-source-id: bfe46ebd78f0b6cd9518c11d09de7849282ed2a2
2018-12-05 08:33:56 -08:00
Roy Li
c03851e93a remove copy_wrapper (#13937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13937

We can now replace s_copy_ with our new _copy_ function. Experimented with moving s_copy_ out of VariableManualType.cpp, but seemed like there was enough special casing to warrant it staying.

Reviewed By: ezyang

Differential Revision: D13053648

fbshipit-source-id: e9e04d460baf4ee49b500212cf91b95221acd769
2018-11-30 11:12:59 -08:00
Sam Gross
006505bb8f Speed-up "advanced" indexing operations (#13420)
Summary:
This speeds-up "advanced" indexing (indexing a tensor by a tensor)
on CPU and GPU. There's still a bunch of work to do, including
speeding up indexing by a byte (boolean) mask and speeding up the derivative
calculation for advanced indexing.

Here's some speed comparisons to indexing on master using a little [benchmark script](https://gist.github.com/colesbury/c369db72aad594e5e032c8fda557d909) with 16 OpenMP threads and on a P100. The test cases are listed as (input shape -> output shape).

| Test case             | CPU (old vs. new)   | CUDA (old vs. new)     |
|-----------------------|---------------------|------------------------|
| 1024x1024 -> 512x1024 | 225 us vs. **57 us**  | 297 us vs. **47 us** |
| 1024x1024 -> 1024x512 | 208 us vs. **153 us** | 335 us vs. **54 us** |
| 50x50 -> 20000x50     | 617 us vs. **77 us**  | 239 us vs. **54 us** |
| 50x50 -> 50x20000     | 575 us vs. **236 us** | 262 us vs. **58 us** |
| 2x5x10 -> 10          | 65 us  vs. **18 us**  | 612 us vs. **93 us** |

See #11647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13420

Reviewed By: soumith

Differential Revision: D13088936

Pulled By: colesbury

fbshipit-source-id: 0a5c2ee9aa54e15f96d06692d1694c3b24b924e2
2018-11-27 15:23:59 -08:00
Your Name
07a8a730af Print warning when ROCm memory leaking is detected in pytorch tests (#14151)
Summary:
We keep seeing random failures in CI because of ROCm memory leaking, e.g:

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3102//console
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/3080//console

To make the CI more stable, turn it to warning instead of failure.

iotamudelta please help investigating the memory leaking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14151

Differential Revision: D13115096

Pulled By: bddppq

fbshipit-source-id: a13b68274ecba363d9d8436aa6a62ac40a77d78c
2018-11-18 00:11:44 -08:00
vishwakftw
a30ade1139 Batched cholesky decomposition (#14017)
Summary:
Implements batching for the Cholesky decomposition.

Performance could be improved with a dedicated batched `tril` and `triu` op, which is also impeding autograd operations.

Changes made:
- batching code
- tests in `test_torch.py`, `test_cuda.py` and `test_autograd.py`.
- doc string modification
- autograd modification
- removal of `_batch_potrf` in `MultivariateNormal`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14017

Differential Revision: D13087945

Pulled By: ezyang

fbshipit-source-id: 2386db887140295475ffc247742d5e9562a42f6e
2018-11-17 10:49:15 -08:00
Sam Gross
c3680e2b19 Fix sum() on fp16 (#13926)
Summary:
The size of the shared and global memory buffers were incorrect for float16.
They were sized based on float16 elements, but the buffers store intermediate
float32 values.

Fixes #13909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13926

Differential Revision: D13048334

Pulled By: colesbury

fbshipit-source-id: 5a07df53f1152d5920258e91ed3f1e1de89b29e1
2018-11-13 16:50:36 -08:00
Richard Zou
e43fb1d26d Fix cuda out of memory test (#13864)
Summary:
torch.randn(big_number_here, dtype=torch.int8) is wrong because randn
isn't implemented for torch.int8. I've changed it to use torch.empty
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13864

Differential Revision: D13032130

Pulled By: zou3519

fbshipit-source-id: d157b651b47b8bd736f3895cc242f07de4c1ea12
2018-11-13 07:30:30 -08:00
Johannes M Dieterich
ce48958606 enable more unit tests (#13166)
Summary:
This enables the distributions and utils test sets for ROCm.
Individual tests are enabled that now pass due to fixes in HIP/HCC/libraries versions in white rabbit.

For attention: bddppq ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13166

Differential Revision: D12814759

Pulled By: bddppq

fbshipit-source-id: ea70e775c707d7a8d2776fede6154a755adef43e
2018-11-12 18:49:52 -08:00
Vishwak Srinivasan
7b2fb012a8 Make potrs batched (#13453)
Summary:
- This is a straightforward PR, building up on the batch inverse PR, except for one change:
  - The GENERATE_LINALG_HELPER_n_ARGS macro has been removed, since it is not very general and the resulting code is actually not very copy-pasty.

Billing of changes:
- Add batching for `potrs`
- Add relevant tests
- Modify doc string

Minor changes:
- Remove `_gesv_single`, `_getri_single` from `aten_interned_strings.h`.
- Add test for CUDA `potrs` (2D Tensor op)
- Move the batched shape checking to `LinearAlgebraUtils.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13453

Reviewed By: soumith

Differential Revision: D12942039

Pulled By: zou3519

fbshipit-source-id: 1b8007f00218e61593fc415865b51c1dac0b6a35
2018-11-09 15:16:26 -08:00
Sam Gross
014ea1e1f8 Improve CUDA out-of-memory error message (#13751)
Summary:
```
The new error message now looks like (from Python):

  RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 7.33 GiB free; 179.00 KiB cached)

Summary of terms:

  "total capacity": total global memory on GPU
  "already allocated": memory allocated by the program using the
                       caching allocator
  "free": free memory as reported by the CUDA API
  "cached": memory held by the allocator but not used by the program

  The "allocated" amount  does not include memory allocated outside
  of the caching allocator, such as memory allocated by other programs
  or memory held by the driver.

  The sum of "allocated" + "free" + "cached" may be less than the
  total capacity due to memory held by the driver and usage by other
  programs.

  Note that at this point cuda_malloc_retry has already returned all
  possible "cached" memory to the driver. The only remaining "cached"
  memory is split from a larger block that is partially in-use.
```

This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because `cudaGetLastError()` was not cleared.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13751

Differential Revision: D13007177

Pulled By: colesbury

fbshipit-source-id: ea7121461b3f2a34646102959b45bde19f2fabab
2018-11-09 14:33:28 -08:00
vishwakftw
0a090fe60a Fix torch.dist for infinity, zero and minus infinity norms (#13713)
Summary: Fixes #13559

Differential Revision: D12981556

Pulled By: zou3519

fbshipit-source-id: 99e86abab3ca045257374a9212ca24e7ca59fe9d
2018-11-08 12:03:07 -08:00
Tongzhou Wang
2448a83d30 Give broadcast_coalesced tensors different version counters (#13594)
Summary:
In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain.

Fixing the bug discovered at https://github.com/pytorch/pytorch/pull/13350#issuecomment-436011370

edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13594

Differential Revision: D12967311

Pulled By: SsnL

fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5
2018-11-07 21:49:35 -08:00
bddppq
4326873330 Skip std and var tests in pytorch rocm CI (#13662)
Summary:
https://github.com/pytorch/pytorch/pull/13435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13662

Reviewed By: soumith

Differential Revision: D12958408

Pulled By: bddppq

fbshipit-source-id: 170b59769fbed149c9246b6549c62160e27d2404
2018-11-07 10:10:25 -08:00
Tongzhou Wang
2f82a06826 Fix half_tensor.bernoulli_(double) (#13474)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/12431
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13474

Differential Revision: D12897834

Pulled By: SsnL

fbshipit-source-id: 598250fd7b9f1d2509ec0e5012724d7895a62daf
2018-11-02 07:46:46 -07:00
Tongzhou Wang
6d2b3cc869 Fix pytest, make it work with run_test.py (#13416)
Summary:
Fixes #13326

Also now you can use `run_test.py` with `pytest`. E.g.,
```
python run_test.py -vci distributed -pt
```

Yes it works with `distributed` and `cpp_extension`.

cc zou3519 vishwakftw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13416

Differential Revision: D12895622

Pulled By: SsnL

fbshipit-source-id: 2d18106f3a118d642a666bfb1318f41c859c3df7
2018-11-01 19:08:06 -07:00
jithunnair-amd
4d141bee98 Skip test_sum_noncontig in ROCm (#13341)
Summary:
Since it fails due to insufficient precision for DoubleTensor .sum() on ROCm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13341

Differential Revision: D12851335

Pulled By: bddppq

fbshipit-source-id: e211c3868b685aa705160ce98a2a18a915ad493f
2018-10-30 16:54:44 -07:00
Tongzhou Wang
8ad69a80e3 Test scripts only run cases defined in the running script (#13250)
Summary:
1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it.
2. Adds an assertion in `load_tests` that each script only runs cases defined in itself.

cc yf225 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250

Differential Revision: D12823734

Pulled By: SsnL

fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b
2018-10-29 13:57:40 -07:00
Sam Gross
52b6460d3a Fix bug in some reductions that use global memory (#13211)
Summary:
Reductions that used global memory, but didn't reduce
across threads in a warp did not have enough global memory
allocated for their intermediate results. These reductions
that were non-contiguous in their reduced dimension and
large enough to benefit from reducing across blocks in a
grid.

Fixes #13209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13211

Differential Revision: D12815772

Pulled By: colesbury

fbshipit-source-id: f78be2cb302e7567a76097ca3ba1e7b801c0cdad
2018-10-29 10:23:30 -07:00
vishwakftw
1fe8278559 Batched Inverse (#9949)
Summary:
Complete billing of changes:

Related to Batch Inverse:
- [x] Add batched inverse (CPU)
- [x] Add batched inverse (CUDA)
- [x] Modify autograd entry
- [x] Add tests
  - [x] test_autograd
  - [x] test_cuda
  - [x] test_torch
- [x] Modify docs
- [x] Remove `_batch_inverse` in `MultivariateNormal`.
- [x] Allow batch matrices as inputs for negative powers in `matrix_power`

Miscellaneous modifications:
- [x] Move all batch operations to BatchLinearAlgebra.cpp/.cu and provide general framework for adding more batch ops.
- [x] Add a RAII structure for MAGMA queue management.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9949

Differential Revision: D10559089

Pulled By: zou3519

fbshipit-source-id: 7da24977f8a79d97dd42883302e13e708c1726e4
2018-10-27 23:42:46 -07:00
Zachary DeVito
dae7616078 Shard all of tests based on how many tests exist. (#13160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13160

Reduces pytorch_core build from 2 hours to 30 minutes

Reviewed By: soumith, dzhulgakov

Differential Revision: D10524261

fbshipit-source-id: 97270ac73404b5ea4c264cd0e9d8d4b1be79b0e9
2018-10-26 18:20:34 -07:00
James Sun
f4944f0f8a Rename test/common.py to test/common_utils.py (#12794)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794

common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.

Reviewed By: orionr

Differential Revision: D10438204

fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
2018-10-17 23:04:29 -07:00
Thomas Viehmann
d80a3eb549 Set philox seed and offset on cuda manual_seed (#12677)
Summary:
Fixes: #12669

Thank you Changmao Cheng for reporting this on the forum with a small example!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12677

Differential Revision: D10391989

Pulled By: ezyang

fbshipit-source-id: 5aa7a705bdb8ce6511a8eb1b3a207f22741046bf
2018-10-15 17:45:59 -07:00
vishwakftw
0740a5d521 compute_uv for SVD (#12517)
Summary:
Adds a `compute_uv` argument that defaults to `True` for optionally computing the singular vectors during SVD.

Closes https://github.com/pytorch/pytorch/issues/12420 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12517

Differential Revision: D10384554

Pulled By: SsnL

fbshipit-source-id: 704998a257afa815eda901b8ae830e8a661695be
2018-10-15 12:35:56 -07:00
vishwakftw
48bc57fa8d Introduce chain_matmul (#12380)
Summary:
- This was one of the few functions left out from the list of functions in
  NumPy's `linalg` module
- `multi_mm` is particularly useful for DL research, for quick analysis of
  deep linear networks
- Added tests and doc string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12380

Differential Revision: D10357136

Pulled By: SsnL

fbshipit-source-id: 52b44fa18d6409bdeb76cbbb164fe4e88224458e
2018-10-12 03:58:12 -07:00
Ailing Zhang
8734b174ca Multinomial raise error (#12490)
Summary:
Fixes #12260 #2896

```
torch.multinomial(torch.FloatTensor([0, 1, 0, 0]), 3, replacement=False)
```
The old behavior is that we return `0` after we run out of postive categories. Now we raise an error based on discussion in the issue thread.

- Add testcase for cpu & cuda case, in cuda case `n_samples=1` is a simple special case, so we test against `n_sample=2` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12490

Differential Revision: D10278794

Pulled By: ailzhang

fbshipit-source-id: d04de7a60f60d0c0d648b975db3f3961fcf42db1
2018-10-10 20:39:04 -07:00
iotamudelta
64f707cd26 Enable more unit tests (ROCm 255) (#12486)
Summary:
* Enable more tests that relied on CPU LAPACK at compile time.
* enabled min/max tests in test_cuda (ROCm 236)

bddppq ezyang

Tests ran as part of the ROCm CI here: https://github.com/ROCmSoftwarePlatform/pytorch/pull/255
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12486

Differential Revision: D10262534

Pulled By: ezyang

fbshipit-source-id: 167a06fc8232af006f4b33dcc625815fd4b06d6b
2018-10-09 15:38:19 -07:00
iotamudelta
a2ebbccc9f fix unit tests on CI
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/12187

Differential Revision: D10118483

Pulled By: bddppq

fbshipit-source-id: 986c8fb48d61e00103c713548a50e74489a0e442
2018-09-28 23:11:55 -07:00
Sam Gross
b263078bc3 Fix CUDA division by a scalar on large arrays. (#12023)
Summary:
The gpu_unary_kernel function was not handling arrays that
cannot use 32-bit indexing. This functions was only called directly
by CUDA division by a scalar. Other arithmetic operations go through
gpu_binary_kernel, which already properly handled large arrays.

This bug sometimes manifested as a crash and sometimes as an incorrect
answer.

Fixes #11788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12023

Differential Revision: D10034017

Pulled By: colesbury

fbshipit-source-id: b17300f327de54035746bf02f576766007c9b144
2018-09-25 13:10:25 -07:00
Sam Gross
1c09bfde1b Make promoteType(half, integer) -> half (#11941)
Summary:
Changes the result type of half type and any integer type to return half
type (instead of float or double).

This is based on top of #11808. The first new commit is "Make promoteType(half, integer) -> half". I'll rebase on top of master once that PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11941

Differential Revision: D10014122

Pulled By: colesbury

fbshipit-source-id: 16a5eb3406a5712069201d872d8736d0599e9411
2018-09-24 13:55:42 -07:00
Sam Gross
1cf5b0c7c1 Fix casting logic for 0d CPU tensors in CUDA ops (#11808)
Summary:
Previously, we didn't cast any 0-dim tensors used in CUDA operations. We
can only avoid the casts for 0-dim CPU tensors used in CUDA operations.

Fixes #11795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11808

Differential Revision: D9922406

Pulled By: colesbury

fbshipit-source-id: 940b8a8534770aa5cd70d5d09b96be0f0f8146ff
2018-09-21 14:19:56 -07:00
Thomas Viehmann
6834dcab1c Align cuda multinomial without replacement to CPU behaviour (#11933)
Summary:
We do this by being more NaN tolerant.

Fixes: #9062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11933

Differential Revision: D9991129

Pulled By: soumith

fbshipit-source-id: c99b04462c1bee90d00eeabb0c111de12f855f4d
2018-09-21 11:04:17 -07:00