Summary:
Changelog:
- Renames `trtrs` to `triangular_solve` to remain consistent with `cholesky_solve` and `solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `triangular_solve` under the name `trtrs`, and add a deprecation warning to not promote usage.
- Move `isnan` to _torch_docs.py
- Remove unnecessary imports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18213
Differential Revision: D14566902
Pulled By: ezyang
fbshipit-source-id: 544f57c29477df391bacd5de700bed1add456d3f
Summary:
- Remove single batch TH/THC implementations
- Remove `_batch_trtrs_lower` from `multivariate_normal`
- Add tests for batched behavior
- Modify trtrs_backward to accommodate for batched case
- Modify docs
In a future PR, this will be renamed to `triangular_solve`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18025
Differential Revision: D14523004
Pulled By: ifedan
fbshipit-source-id: 11c6a967d107f969b60e5a5c73ce6bb8099ebbe1
Summary:
Changelog:
- Renames `gesv` to `solve` to remain consistent with `cholesky_solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `solve` under the name `gesv`, and add a deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18060
Differential Revision: D14503117
Pulled By: zou3519
fbshipit-source-id: 99c16d94e5970a19d7584b5915f051c030d49ff5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17927
ghimport-source-id: 626d321e430b6b5c0ea3aa1eb9df8c1e2d058bf8
Stack:
* #17926 Implement at::has_internal_overlap helper function
* **#17927 Error out on in-place (unary) ops on tensors that have internal overlap**
On the way to #17935.
Works for CPU and CUDA on the following ops:
- abs_, acos_, asin_, atan_, ceil_, cos_, erf_, erfc_, exp_, expm1_
- floor_, log_, log10_, log1p_, log2_, round_, rsqrt_,
- sin_, sqrt_, tan_, tanh_, trunc_
This PR adds a check to see if the out/result tensor has internal
overlap. If it does, then we error out because the result **may** be
incorrect.
This is overly conservative; there are some cases where if the result is
the same as the input, the inplace operation is OK (such as floor_,
round_, and trunc_). However, the current code isn't organized in such a
way that this is easy to check, so enabling those will come in the future.
Reviewed By: ezyang
Differential Revision: D14438871
fbshipit-source-id: 15e12bf1fdb2ab7f74bb806e22bc74840bd6abd1
Summary:
ROCm 2.2 was released today, if we respin the CI docker images with the attached, PyTorch/Caffe2 will support ROCm 2.2
Changes necessary:
* for the Ubuntu target, HIP PR 934 needs to be applied to fix the forceinline definition. ROCm 2.3 will contain this.
* two unit tests proof flaky on different platforms, disable them defensively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18007
Differential Revision: D14473903
Pulled By: bddppq
fbshipit-source-id: b1939f11d1c765a3bf71bb244b15f6ceb0e816d3
Summary:
This PR causes kthvalue to be consistent with sort
(i.e. treat NaN as larger than any number), so that
`a.kthvalue(n) == a.sort()[n - 1]`.
One drawback is that median with a NaN argument does not return NaN,
which is a deviation from NumPy.
Thank you, ngimel, for raising this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17824
Differential Revision: D14410092
Pulled By: ezyang
fbshipit-source-id: bdec2d8272dc4c65bcf2f9b8995e237774c44c02
Summary:
xxtemp, colesbury, bhushan23, zou3519, convert gpu round behavior to half-to-even, consistent with torch cpu version and numpy. You feedback are welcomed.
See #16498
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17443
Differential Revision: D14261786
Pulled By: VitalyFedyunin
fbshipit-source-id: 98156436b545d72769831a89e2775d43ad913ebc
Summary:
When switching back to `d0` from a stream on a different device `d1`, we need to restore the current streams on both `d0` and `d1`. The current implementation only does that for `d0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17439
Differential Revision: D14208919
Pulled By: mrshenli
fbshipit-source-id: 89f2565b9977206256efbec42adbd789329ccad8
Summary:
update:
1. global_reduce check for should_block_y_reduce first.
This avoids the enabling global_reduce without block_y_reduce. Leading to
accessing shared memory during global reduce without allocation.
2. updating block_y_reduce heuristics. Improves perf on tiny tensors
3. adding test case covering old cases where illegal memory access might occur
TensorIterator cuda launch configs update (#16224)
Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible
block dimension to improve efficiency for reduction cases with small fast
dimension.
Previously TensorIterator launches blocks with fixed 32x16 threads.
For cases like:
import torch
torch.randn(2**20, 4, device='cuda').sum(0)
The fixed launch config does handle coalesced memory access efficiently.
Updated launch configure enables flexible block dimension. Combining with
improved reduction scheme (using flexible vertical / horizontal reduction
instead of limited warp / block reduction in the old code), it ensures optimal
memory access pattern even with reduction on dimension with small stride.
Possible future improvements:
1. Precise dynamic shared memory allocation.
2. Using warp shuffle for vertical (block_y) reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17040
Differential Revision: D14078295
Pulled By: umanwizard
fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c
Summary:
This fixes the segfault.
Changelog:
- Modify the function calls in LegacyDefinitions for `geqrf_out` and `ormqr_out`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16964
Differential Revision: D14025985
Pulled By: gchanan
fbshipit-source-id: aa50e2c1694cbf3642273ee14b09ba12625c7d33
Summary:
This is the first round of enabling unit tests that work on ROCm 2.1 in my tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16871
Differential Revision: D13997662
Pulled By: bddppq
fbshipit-source-id: d909a3f7dd5fc8f85f126bf0613751c8e4ef949f
Summary:
1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation.
2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++
3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937
Differential Revision: D13649001
Pulled By: mrshenli
fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240
Summary:
Adding supports for torch.nomr:
i. multi dimensions for dim
ii. dtype that specifies math/output tensor type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15414
Differential Revision: D13702022
Pulled By: ezyang
fbshipit-source-id: da2676f2b6aff988889b1539d0de8ecd4946823a
Summary:
The cumsum over the probabilities can be not monotonically
non-decreasing. Thus it is hard to detect zero probability
classes using just the cumsum.
This changes the binary search postprocessing to use the
(non-cumulated) distribution instead.
Thank you, jcjohnson, for the bug report with
reproducing case.
Fixes: #13867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16075
Differential Revision: D13695565
Pulled By: soumith
fbshipit-source-id: 02c4d6f868f0050c1ae7d333f4317c5610e49cd9
Summary:
The correct logic is as follows:
* If there is an earlier split, we need to combine with its result
* If there is *not* a later split, we need to project before saving into the output.
This should partially f i x #15837 . For example:
```
In [7]: a=torch.ones([1838860800], dtype=torch.float, device="cuda:1")
In [8]: a.mean()
Out[8]: tensor(1., device='cuda:1')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16023
Differential Revision: D13678449
Pulled By: umanwizard
fbshipit-source-id: ab5078484c88e96bb30121b5cf24a0e8b0a8c2f8
Summary:
see #15682
This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner.
~Test TBA~ Added
FYI jeffreyksmithjr
now
```python
In [1]: def f():
...: d0 = torch.device('cuda:0')
...: d1 = torch.device('cuda:1')
...: with torch.cuda.device(d0):
...: s0 = torch.cuda.current_stream()
...: with torch.cuda.device(d1):
...: s1 = torch.cuda.current_stream()
...: s0.query()
...: s1.query()
In [4]: %timeit f()
38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
before
```python
In [4]: %timeit f()
28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689
Differential Revision: D13571697
Pulled By: mrshenli
fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c
Summary:
Changelog:
- Optimize btriunpack by using `torch.where` instead of indexing, inplace operations instead of out place operations and avoiding costly permutations by computing the final permutation over a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15286
Differential Revision: D13562038
Pulled By: soumith
fbshipit-source-id: e2c94cfab5322bf1d24bf56d7b056619f553acc6
Summary:
This PR removes the TH/THC binding for gesv.
Changelog:
- Remove TH/THC binding
- Port single matrix case to ATen
- Enable test_gesv for CUDA as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15510
Differential Revision: D13559990
Pulled By: soumith
fbshipit-source-id: 9da2825e94d3103627e719709e6b1f8b521a07fb
Summary:
Currently torch.isinf on integral tensor will raise RuntimeError: value cannot be converted to type int16_t without overflow: inf.
This pr will suppress the error and return false(0) for all integral tensors. The behavior will also be consistent with np.isinf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15489
Reviewed By: zou3519
Differential Revision: D13540786
Pulled By: flashhack
fbshipit-source-id: e730dea849da6a59f3752d347bcfbadfd12c6483
Summary:
Followup PR of #14904, and the stretch goal of #12653.
Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor.
The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers.
Algorithm details are describe in [comments of TensorFactories.cu](23ddb6f58a/aten/src/ATen/native/cuda/TensorFactories.cu (L109-L255)).
zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203
Reviewed By: zou3519
Differential Revision: D13517695
Pulled By: mrshenli
fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea
Summary:
Changelog:
- Renames `potrs` to `cholesky_solve` to remain consistent with Tensorflow and Scipy (not really, they call their function chol_solve)
- Default argument for upper in cholesky_solve is False. This will allow a seamless interface between `cholesky` and `cholesky_solve`, since the `upper` argument in both function are the same.
- Rename all tests
- Create a tentative alias for `cholesky_solve` under the name `potrs`, and add deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15334
Differential Revision: D13507724
Pulled By: soumith
fbshipit-source-id: b826996541e49d2e2bcd061b72a38c39450c76d0
Summary:
…on](#12115)
mean is calculated in two step sum()/numel(). For half precision, data gets
casted back to half after sum().
We fused the division into the reduction kernel by adding pre_op/post_op.
This allows us to do torch.ones(65536).cuda().half().mean() to return correct
result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14878
Differential Revision: D13491159
Pulled By: soumith
fbshipit-source-id: e83802e1628b6d2615c45e18d7acf991d143a09e
Summary:
tests work on ROCm 1.9.2 as present on CI (fp16 bringup, hipMemset and sparse improvements)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15232
Differential Revision: D13470991
Pulled By: bddppq
fbshipit-source-id: 45acc4f9ea5baaaf7672b86eb022948055779925
Summary:
This is an optimized implementation that does the following:
1. created an empty Tensor of correct size.
2. fill the Tensor with correct values.
The following three designs to fill in the Tensor result in roughly the same performance. Hence, the 2nd option is taken for simpler code, and to return contiguous tensors.
1. Sequential: fill row coordinates first, then columns. This results in two for-loop and more arithmetic operations.
2. Interleaved: fill in index coordinates one by one, which jumps between the two output Tensor rows in every iteration.
3. Transpose: create a n X 2 Tensor, fill the Tensor sequentially, and then transpose it.
<img width="352" alt="screen shot 2018-12-10 at 3 54 39 pm" src="https://user-images.githubusercontent.com/16999635/49769172-07bd3580-fc94-11e8-8164-41839185e9f9.png">
NOTE:
This implementation returns a 2D tensor, instead of a tuple of two tensors. It means that users will not be able to do the following:
```python
x = torch.ones(3, 3)
i = torch.tril_indices(3, 3)
x[i] # need to first convert the 2D tensor into a tuple of two 1D tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14904
Reviewed By: zou3519
Differential Revision: D13433027
Pulled By: mrshenli
fbshipit-source-id: 41c876aafcf584832d7069f7c5929ffb59e0ae6a
Summary:
* Enable unit tests known to work on ROCm.
* Disable a few that are known to be flaky for the time being.
* Use std::abs for Half
* No more special casing for ROCm in TensorMathReduce
* Document an important detail for a hardcoded block size w.r.t. ROCm in TensorMathReduce
ezyang bddppq for awareness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14011
Differential Revision: D13387679
Pulled By: bddppq
fbshipit-source-id: 4177f2a57b09d866ccbb82a24318f273e3292f71
Summary:
Removes cast of half to float in torch.sum, with float16 input tensor and
float32 output tensor, instead we cast data when loading input in kernel.
This supposingly would save a kernel launch as well as a full global memory load
on promoted data type (float).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14580
Differential Revision: D13356203
Pulled By: ezyang
fbshipit-source-id: 85e91225b880a65fe3ceb493371b9b36407fdf48
Summary:
Fixes https://github.com/pytorch/pytorch/issues/14673
As pointed out by vishwakftw , the root case of the `deepcopy` issue was that `storage.clone()` would create a new storage in the default device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14751
Reviewed By: soumith
Differential Revision: D13323061
Pulled By: fmassa
fbshipit-source-id: bfe46ebd78f0b6cd9518c11d09de7849282ed2a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13937
We can now replace s_copy_ with our new _copy_ function. Experimented with moving s_copy_ out of VariableManualType.cpp, but seemed like there was enough special casing to warrant it staying.
Reviewed By: ezyang
Differential Revision: D13053648
fbshipit-source-id: e9e04d460baf4ee49b500212cf91b95221acd769
Summary:
This speeds-up "advanced" indexing (indexing a tensor by a tensor)
on CPU and GPU. There's still a bunch of work to do, including
speeding up indexing by a byte (boolean) mask and speeding up the derivative
calculation for advanced indexing.
Here's some speed comparisons to indexing on master using a little [benchmark script](https://gist.github.com/colesbury/c369db72aad594e5e032c8fda557d909) with 16 OpenMP threads and on a P100. The test cases are listed as (input shape -> output shape).
| Test case | CPU (old vs. new) | CUDA (old vs. new) |
|-----------------------|---------------------|------------------------|
| 1024x1024 -> 512x1024 | 225 us vs. **57 us** | 297 us vs. **47 us** |
| 1024x1024 -> 1024x512 | 208 us vs. **153 us** | 335 us vs. **54 us** |
| 50x50 -> 20000x50 | 617 us vs. **77 us** | 239 us vs. **54 us** |
| 50x50 -> 50x20000 | 575 us vs. **236 us** | 262 us vs. **58 us** |
| 2x5x10 -> 10 | 65 us vs. **18 us** | 612 us vs. **93 us** |
See #11647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13420
Reviewed By: soumith
Differential Revision: D13088936
Pulled By: colesbury
fbshipit-source-id: 0a5c2ee9aa54e15f96d06692d1694c3b24b924e2
Summary:
Implements batching for the Cholesky decomposition.
Performance could be improved with a dedicated batched `tril` and `triu` op, which is also impeding autograd operations.
Changes made:
- batching code
- tests in `test_torch.py`, `test_cuda.py` and `test_autograd.py`.
- doc string modification
- autograd modification
- removal of `_batch_potrf` in `MultivariateNormal`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14017
Differential Revision: D13087945
Pulled By: ezyang
fbshipit-source-id: 2386db887140295475ffc247742d5e9562a42f6e
Summary:
The size of the shared and global memory buffers were incorrect for float16.
They were sized based on float16 elements, but the buffers store intermediate
float32 values.
Fixes#13909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13926
Differential Revision: D13048334
Pulled By: colesbury
fbshipit-source-id: 5a07df53f1152d5920258e91ed3f1e1de89b29e1
Summary:
torch.randn(big_number_here, dtype=torch.int8) is wrong because randn
isn't implemented for torch.int8. I've changed it to use torch.empty
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13864
Differential Revision: D13032130
Pulled By: zou3519
fbshipit-source-id: d157b651b47b8bd736f3895cc242f07de4c1ea12
Summary:
This enables the distributions and utils test sets for ROCm.
Individual tests are enabled that now pass due to fixes in HIP/HCC/libraries versions in white rabbit.
For attention: bddppq ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13166
Differential Revision: D12814759
Pulled By: bddppq
fbshipit-source-id: ea70e775c707d7a8d2776fede6154a755adef43e
Summary:
- This is a straightforward PR, building up on the batch inverse PR, except for one change:
- The GENERATE_LINALG_HELPER_n_ARGS macro has been removed, since it is not very general and the resulting code is actually not very copy-pasty.
Billing of changes:
- Add batching for `potrs`
- Add relevant tests
- Modify doc string
Minor changes:
- Remove `_gesv_single`, `_getri_single` from `aten_interned_strings.h`.
- Add test for CUDA `potrs` (2D Tensor op)
- Move the batched shape checking to `LinearAlgebraUtils.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13453
Reviewed By: soumith
Differential Revision: D12942039
Pulled By: zou3519
fbshipit-source-id: 1b8007f00218e61593fc415865b51c1dac0b6a35
Summary:
```
The new error message now looks like (from Python):
RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 7.33 GiB free; 179.00 KiB cached)
Summary of terms:
"total capacity": total global memory on GPU
"already allocated": memory allocated by the program using the
caching allocator
"free": free memory as reported by the CUDA API
"cached": memory held by the allocator but not used by the program
The "allocated" amount does not include memory allocated outside
of the caching allocator, such as memory allocated by other programs
or memory held by the driver.
The sum of "allocated" + "free" + "cached" may be less than the
total capacity due to memory held by the driver and usage by other
programs.
Note that at this point cuda_malloc_retry has already returned all
possible "cached" memory to the driver. The only remaining "cached"
memory is split from a larger block that is partially in-use.
```
This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because `cudaGetLastError()` was not cleared.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13751
Differential Revision: D13007177
Pulled By: colesbury
fbshipit-source-id: ea7121461b3f2a34646102959b45bde19f2fabab
Summary:
In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain.
Fixing the bug discovered at https://github.com/pytorch/pytorch/pull/13350#issuecomment-436011370
edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13594
Differential Revision: D12967311
Pulled By: SsnL
fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5
Summary:
Fixes#13326
Also now you can use `run_test.py` with `pytest`. E.g.,
```
python run_test.py -vci distributed -pt
```
Yes it works with `distributed` and `cpp_extension`.
cc zou3519 vishwakftw
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13416
Differential Revision: D12895622
Pulled By: SsnL
fbshipit-source-id: 2d18106f3a118d642a666bfb1318f41c859c3df7
Summary:
Since it fails due to insufficient precision for DoubleTensor .sum() on ROCm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13341
Differential Revision: D12851335
Pulled By: bddppq
fbshipit-source-id: e211c3868b685aa705160ce98a2a18a915ad493f
Summary:
1. Refactors `TestTorch` into `TestTorchMixin` (subclass of `object`) and `TestTorch` (subclass of `TestCase`, MRO `(TestCase, TestTorchMixin)`, only defined if `__name__ == '__main__'`). So other scripts won't accidentally run it.
2. Adds an assertion in `load_tests` that each script only runs cases defined in itself.
cc yf225 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13250
Differential Revision: D12823734
Pulled By: SsnL
fbshipit-source-id: 7a169f35fe0794ce76e310d8a137d9a3265c012b
Summary:
Reductions that used global memory, but didn't reduce
across threads in a warp did not have enough global memory
allocated for their intermediate results. These reductions
that were non-contiguous in their reduced dimension and
large enough to benefit from reducing across blocks in a
grid.
Fixes#13209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13211
Differential Revision: D12815772
Pulled By: colesbury
fbshipit-source-id: f78be2cb302e7567a76097ca3ba1e7b801c0cdad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12794
common.py is used in base_module for almost all tests in test/. The
name of this file is so common that can easily conflict with other dependencies
if they happen to have another common.py in the base module. Rename the file to
avoid conflict.
Reviewed By: orionr
Differential Revision: D10438204
fbshipit-source-id: 6a996c14980722330be0a9fd3a54c20af4b3d380
Summary:
Fixes: #12669
Thank you Changmao Cheng for reporting this on the forum with a small example!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12677
Differential Revision: D10391989
Pulled By: ezyang
fbshipit-source-id: 5aa7a705bdb8ce6511a8eb1b3a207f22741046bf
Summary:
- This was one of the few functions left out from the list of functions in
NumPy's `linalg` module
- `multi_mm` is particularly useful for DL research, for quick analysis of
deep linear networks
- Added tests and doc string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12380
Differential Revision: D10357136
Pulled By: SsnL
fbshipit-source-id: 52b44fa18d6409bdeb76cbbb164fe4e88224458e
Summary:
Fixes#12260#2896
```
torch.multinomial(torch.FloatTensor([0, 1, 0, 0]), 3, replacement=False)
```
The old behavior is that we return `0` after we run out of postive categories. Now we raise an error based on discussion in the issue thread.
- Add testcase for cpu & cuda case, in cuda case `n_samples=1` is a simple special case, so we test against `n_sample=2` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12490
Differential Revision: D10278794
Pulled By: ailzhang
fbshipit-source-id: d04de7a60f60d0c0d648b975db3f3961fcf42db1
Summary:
* Enable more tests that relied on CPU LAPACK at compile time.
* enabled min/max tests in test_cuda (ROCm 236)
bddppq ezyang
Tests ran as part of the ROCm CI here: https://github.com/ROCmSoftwarePlatform/pytorch/pull/255
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12486
Differential Revision: D10262534
Pulled By: ezyang
fbshipit-source-id: 167a06fc8232af006f4b33dcc625815fd4b06d6b
Summary:
The gpu_unary_kernel function was not handling arrays that
cannot use 32-bit indexing. This functions was only called directly
by CUDA division by a scalar. Other arithmetic operations go through
gpu_binary_kernel, which already properly handled large arrays.
This bug sometimes manifested as a crash and sometimes as an incorrect
answer.
Fixes#11788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12023
Differential Revision: D10034017
Pulled By: colesbury
fbshipit-source-id: b17300f327de54035746bf02f576766007c9b144
Summary:
Changes the result type of half type and any integer type to return half
type (instead of float or double).
This is based on top of #11808. The first new commit is "Make promoteType(half, integer) -> half". I'll rebase on top of master once that PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11941
Differential Revision: D10014122
Pulled By: colesbury
fbshipit-source-id: 16a5eb3406a5712069201d872d8736d0599e9411
Summary:
Previously, we didn't cast any 0-dim tensors used in CUDA operations. We
can only avoid the casts for 0-dim CPU tensors used in CUDA operations.
Fixes#11795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11808
Differential Revision: D9922406
Pulled By: colesbury
fbshipit-source-id: 940b8a8534770aa5cd70d5d09b96be0f0f8146ff
Summary:
We do this by being more NaN tolerant.
Fixes: #9062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11933
Differential Revision: D9991129
Pulled By: soumith
fbshipit-source-id: c99b04462c1bee90d00eeabb0c111de12f855f4d
Summary:
+ https://github.com/pytorch/pytorch/issues/10236 : torch.bernoulli's out kwarg is broken
fixed in moving `bernoulli_out` to ATen
+ https://github.com/pytorch/pytorch/issues/9917 : BUG torch.bernoulli(p.expand(shape)) is broken
fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods
+ https://github.com/pytorch/pytorch/issues/10357 : torch.bernoulli inconsistent gpu/cpu results
fixed by adding CUDA asserts
In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors.
The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like:
```cpp
// The template argument `4` below indicates that we want to operate on four
// element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details.
at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>(
ret, p,
[seeds] __device__(
int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4,
const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) {
curandStatePhilox4_32_10_t state;
curand_init(
seeds.first,
blockIdx.x * blockDim.x + threadIdx.x,
seeds.second,
&state);
float4 rand = curand_uniform4(&state);
switch (n) {
case 4: {
assert(0 <= p4 && p4 <= 1);
v4 = static_cast<scalar_t>(rand.w <= p4);
}
case 3: {
assert(0 <= p3 && p3 <= 1);
v3 = static_cast<scalar_t>(rand.z <= p3);
}
case 2: {
assert(0 <= p2 && p2 <= 1);
v2 = static_cast<scalar_t>(rand.y <= p2);
}
case 1: {
assert(0 <= p1 && p1 <= 1);
v1 = static_cast<scalar_t>(rand.x <= p1);
}
}
}
);
```
Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops:
post patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
6.841588497161865 +- 0.05413117632269859
torch.bernoulli(xc)
0.05963418632745743 +- 0.0008014909108169377
x.bernoulli_()
0.4024486541748047 +- 0.0021550932433456182
xc.bernoulli_()
0.02167394384741783 +- 2.3818030967959203e-05
```
pre-patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
12.394511222839355 +- 0.0966421514749527
torch.bernoulli(xc)
0.08970972150564194 +- 0.0038722590543329716
x.bernoulli_()
1.654480218887329 +- 0.02364428900182247
xc.bernoulli_()
0.058352887630462646 +- 0.003094920190051198
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10273
Differential Revision: D9831294
Pulled By: SsnL
fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088
Summary:
- Incorporates MKL addition by mingfeima Thank you! (but all errors are my own)
- Native CPU implementation: defer to matrix multiplication for
small batches and parallelize over batch dimension for large
batches.
- Add bmm test for CUDA just to be sure.
This is a partial fix for #10661, getting down to a factor ~5.
Considerable overhead is incurred for the setup in einsum. It might
be more efficient to eventually define an optimized contraction
functions for arbitrary and several dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11292
Differential Revision: D9784941
Pulled By: ezyang
fbshipit-source-id: f6dded2c6f5e8f0461fb38f31f9a824992a58358
Summary:
vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it.
I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead.
Note to reviewers: patch was already approved at #10068 .
cc yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421
Differential Revision: D9733407
Pulled By: SsnL
fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599
Summary:
Disables two of the unit tests in test_cuda that got introduced after test_cuda was enabled that fail on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11191
Differential Revision: D9628702
Pulled By: ezyang
fbshipit-source-id: 4c298c728f42bb43d39b57967aa3e44385980265
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893
Differential Revision: D9615053
Pulled By: ezyang
fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
Summary:
Fix#10345, which only happens in CUDA case.
* Instead of returning some random buffer, we fill it with zeros.
* update torch.symeig doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645
Reviewed By: soumith
Differential Revision: D9395762
Pulled By: ailzhang
fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba
Summary:
In the shortcut for n_sample=1, when category 0 has 0 weight,
we should not map the (uniform) sample 0 to category 0.
The conversion uniform->multinomial was apparently written to work on
a (0,1] range (like curand uses), but PyTorch uses a [0,1) range.
Fixes: #4858. Thank you, Roy Fejgin for reporting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9960
Reviewed By: soumith
Differential Revision: D9341793
Pulled By: ailzhang
fbshipit-source-id: 6b1a96419a7bc58cc594f761f34c6408ff6354cf
Summary:
```
This adds TensorIterator, a helper class for computing element-wise
operations that's intended to replace the CPU and CUDA apply utils
functions.
CPU kernels are implemented as functions that operate on strided 1-d
tensors compared to CPUApplyUtils which operated individual elements. This
allows the kernels to handle vectorization, while TensorIterator handles
parallelization and non-coalesced dimensions.
GPU kernels continue to operate on elements, but the number of
specializations is reduced. The contiguous case remains the same. The
non-contiguous case uses a single (reduced) shape for all operands and
the fast integer division from THCIntegerDivider. To avoid extra
specializations for indexing with 64-bits, large operations are split
into smaller operations that can be indexed with 32-bits.
Major semantic changes:
- No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by
TensorIterator. The autograd engine performs the reduction assuming
standard broadcasting if the gradient shape does not match the
expected shape. Functions that do not use standard broadcasting rules
should either continue to trace the expand calls or handle the
reduction in their derivative formula.
- Use ONNX v7, which supports broadcasting ops.
Performance impact:
- Small increased fixed overhead (~0.5 us)
- Larger overhead for wrapped numbers (~2.5 us)
- No significant change for ops on contiguous tensors
- Much faster worst-case performance for non-contiguous GPU tensors
- Faster CPU bias addition (~2x)
- Faster GPU bias addition (~30% faster)
Future work:
- Decrease overhead, especially for wrapping numbers in Tensors
- Handle general inter-type operations
- Extend to unary ops and reductions
- Use buffering for compute-bound operations on non-contiguous tensors
(pull in from CPUApplyUtils)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919
Differential Revision: D8677600
Pulled By: colesbury
fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9497Fixes#7883 by using `rfft`.
It's worth noting that this is BC breaking. And it's impossible to detect the change because the two signatures before and after this change supports a common subset of calling patterns, e.g., `stft(Tensor, int, int)`. (some other calling patterns will raise error).
soumith and I plan to change the current `stft` interface because it is a bit messy and non-standard. rafaelvalle suggested us that `librosa` is a good reference API to align with. After discussing with soumith and ezyang , and given that `stft` is only out for 1 release, I decide to go with directly changing the signature. Also, my understanding is that most researchers in this field will welcome this change as `librosa` seems to be the golden-standard here. (it doesn't yet support all `pad_mode` but those will become available if added to `F.pad`.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9308
Reviewed By: ezyang
Differential Revision: D8806148
Pulled By: SsnL
fbshipit-source-id: f6e8777d0c34d4a4d7024e638dc9c63242e8bb58
Summary:
test_cuda.py uses routine 'number' to prepare many testscases.
number should return a floating point value for float-type tensor
types, or integer otherwise. But number's test to classify the type
is incorrect, so it always returns the integer value.
(type(t).__name__ is always 'torch.tensortype' so never matches
'Double', 'Float', or 'Half'.)
Update number to use the existing is_floating() helper to make the
check.
The change to number causes a few tests to fail for HalfTensor. Relax
the tolerance for those in line with other HalfTensor testcases. The
failing tests--for addcdiv and fill--were not previously relaxed for
HalfTensor so are held to the over-strict 1e-5 default tolerance.
Finally, update a couple other tests for HalfTensor type to use the
existing is_half() helper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9475
Reviewed By: yf225
Differential Revision: D8872112
Pulled By: ezyang
fbshipit-source-id: 016e3e15adb23f6606bd4c08218954c1396699db
Summary:
1. Let `ModuleTest` raise when they fail on non-contiguous inputs. Fix legacy modules.
2. Fix BN (both THNN and cuDNN) not working on non-contiguous inputs.
3. Fix CUDA EmbeddingBag not working on non-contiguous inputs. To prevent calling `.contiguous()` on in both `forward` and `backward`,
a. prefix all current `embedding_bag*` functions with `_`, indicating that they require input to be contiguous (there is a check in each function).
b. create `embedding_bag`, which makes input arguments `.contiguous()`, and calls `_embedding_bag`
3. Make many ATen `embedding*` functions to work on non-contiguous inputs so we don't need to call `input = input.contiguous()` in Python `nn.functional.embedding`.
4. Fix dense-sparse addition when the sparse input is not coalesced and indices or values tensor is not contiguous. This came up in the test cases of Embedding modules with `sparse=True`. Added tests.
5. Update `TensorUtils.cpp` to use `AT_*` macros.
Request:
review from cpuhrsch on the `Embedding*` changes.
review from ezyang on ATen sparse & BN changes.
Closes https://github.com/pytorch/pytorch/pull/9114
Differential Revision: D8717299
Pulled By: SsnL
fbshipit-source-id: 0acc6f1c9522b5b605361e75112c16bbe1e98527
Summary:
This will resolve some of the timeout issues in CPU and GPU tests internally.
Closes https://github.com/pytorch/pytorch/pull/9061
Reviewed By: ezyang
Differential Revision: D8707471
Pulled By: yf225
fbshipit-source-id: 9dc82a2c9da0c540ae015442f74b9b2b1a67a246