Summary:
Rehash of https://github.com/pytorch/pytorch/issues/22322 .
Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6.
This PR adds the skip tests and some semantic changes for PyTorch.
Added pattern match skip for anything but the ROCm CI compared to #223222 for the python find step in the PyTorch build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23088
Differential Revision: D16448261
Pulled By: bddppq
fbshipit-source-id: 69ece1a213418d9abf1444c496dce1c190ee07c8
Summary:
Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6.
This PR adds the skip tests and some semantic changes for PyTorch.
Open tasks/questions:
* RoiAlignTest.CheckCPUGPUEqual fails in the Caffe2 unit tests. Is this something expects / can be skipped?
* for testing, I've used update-alternatives on CentOS/Ubuntu to select python == python 3.6. Is this the preferred way?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22322
Differential Revision: D16199862
Pulled By: ezyang
fbshipit-source-id: 46ca6029a232f7d23f3fdb5efc33ae39a379fca8
Summary:
Changelog:
- Port SVD TH implementation to ATen/native/BatchLinearAlgebra.cpp
- Port SVD THC implementation to ATen/native/cuda/BatchLinearAlgebra.cu
- Allow batches of matrices as arguments to `torch.svd`
- Remove existing implementations in TH and THC
- Update doc string
- Update derivatives to support batching
- Modify nuclear norm implementation to use at::svd instead of _batch_svd
- Remove _batch_svd as it is redundant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21588
Test Plan:
- Add new test suite for SVD in test_torch.py with port to test_cuda.py
- Add tests in common_methods_invocations.py for derivative testing
Differential Revision: D16266115
Pulled By: nairbv
fbshipit-source-id: e89bb0dbd8f2d58bd758b7830d2389c477aa61fb
Summary:
Some of my qpth users have told me that updating to the latest version of PyTorch and replacing the btrifact/btrisolve calls with the LU ones wasn't working and I didn't believe them until I tried it myself :)
These updates have broken unpivoted LU factorizations/solves on CUDA. The LU factorization code used to return the identity permutation when pivoting wasn't used but now returns all zeros as the pivots. This PR reverts it back to return the identity permutation. I've not yet tested this code as I'm having some trouble compiling PyTorch with this and am hitting https://github.com/pytorch/pytorch/issues/21700 and am not sure how to disable that option.
Here's a MWE to reproduce the broken behavior, and my fix.
```python
torch.manual_seed(0)
n = 4
L = torch.randn(n,n)
A = L.mm(L.t()).unsqueeze(0)
b = torch.randn(1, n)
A_lu_cpu = torch.lu(A)
A_lu_cuda_nopivot = torch.lu(A.cuda(), pivot=False)
A_lu_cuda_pivot = torch.lu(A.cuda(), pivot=True)
print('A_lu_cuda_nopivot\n', A_lu_cuda_nopivot)
print('-----\nA_lu_cuda_pivot\n', A_lu_cuda_nopivot)
x_cpu = b.lu_solve(*A_lu_cpu)
x_cuda_nopivot = b.cuda().lu_solve(*A_lu_cuda_nopivot)
x_cuda_nopivot_fixed = b.cuda().lu_solve(
A_lu_cuda_nopivot[0], torch.arange(1, n+1, device='cuda:0').int())
x_cuda_pivot = b.cuda().lu_solve(*A_lu_cuda_pivot)
print(x_cpu, x_cuda_nopivot, x_cuda_nopivot_fixed, x_cuda_pivot)
```
Output:
```
A_lu_cuda_nopivot
(tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337],
[-0.2656, 5.5724, -1.1316, 0.6678],
[ 0.3062, -0.2031, 1.4206, -0.5438],
[-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32))
-----
A_lu_cuda_pivot
(tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337],
[-0.2656, 5.5724, -1.1316, 0.6678],
[ 0.3062, -0.2031, 1.4206, -0.5438],
[-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32))
(tensor([[-0.3121, -0.1673, -0.4450, -0.2483]]),
tensor([[-0.1661, -0.1875, -0.5694, -0.4772]], device='cuda:0'),
tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0'),
tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0'))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22242
Differential Revision: D16049334
Pulled By: ezyang
fbshipit-source-id: 7eacae810d87ffbdf8e07159bbbc03866dd9979d
Summary:
Try to fix a sporadic failure on some CIs.
I've run this test hundreds of times on my machine (GeForce 1060, MAGMA) but I cannot reproduce this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21638
Differential Revision: D15827779
Pulled By: ezyang
fbshipit-source-id: 3586075e48907b3b84a101c560a34cc733514a02
Summary:
This PR covers two important points with respect to the QR decomposition:
- batching of input matrices (#7500)
- adding `some` as an option in `torch.qr` akin to NumPy's `mode` option (#10538)
Changelog:
- Enable batching for inputs to `torch.qr`
- Move QR decomposition implementation to ATen (CPU and CUDA)
- Remove existing implementations in TH/THC
- Add a `some` option to `torch.qr` that will enable users to switch between complete and reduced decomposition
- Modify doc strings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20689
Differential Revision: D15529230
Pulled By: soumith
fbshipit-source-id: 16af82b1d2db8a3a758fa8a5f798d83f5f950efb
Summary:
This #20919 without the changes to aten/src/THC/THCIntegerDivider.cuh
that broke the ROCm build.
cc bddppq
Original summary:
This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.
This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.
More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21019
Differential Revision: D15518477
Pulled By: colesbury
fbshipit-source-id: 4db5626fda76eb58250793e8aa7d4f2832db3a34
Summary:
This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.
This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.
More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981eFixes#20888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20919
Differential Revision: D15501945
Pulled By: colesbury
fbshipit-source-id: e876e678e866d2efda8ee92c47a1d2d1310671f0
Summary:
This PR also moves Device::validate into the header file, which makes
statements like `Device d = kCPU` effectively free.
Device includes the device's index, so TensorIterator::compute_types
now implicitly checks that all CUDA inputs are on the same GPU.
Previously, this was done ad-hoc in places like TensorIterator::binary_op.
Note that zero-dim Tensor (scalars) are NOT required to be on the
same device as other inputs because they behave almost like Python numbers.
TensorIterator handles copying zero-dim Tensors to the common device.
Prior to this PR, TensorIterator would copy zero-dim Tensors between CPU
and GPU, but not between different GPUs (because Backend didn't encode
the GPU index). This removes that restriction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20690
Differential Revision: D15414826
Pulled By: colesbury
fbshipit-source-id: 1d0ad1f7d663252af36dd4590bcda418c2f7a09f
Summary:
Copy.cu goes from 308 to 190 lines of code. In general it uses, the same
copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy
using temporary buffers. The pointwise kernel has slightly improved
performance when broadcasting due to faster index calculation.
This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only
entry-point now is "`copy_`".
A mini-benchmark is here:
https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992
Before:
https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18
After:
https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc
Results were measured on 2.2 GHz Broadwell; no-turbo; one thread;
compiled with GCC 7.3.0. (Results are slower than typical usage due to
turbo being off.)
The only significant differences is in the CUDA [1024] -> [1024, 1024]
broadcasting copy which is ~25% faster. I don't expect a noticeable
difference in real programs.
CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect
anyone to notice that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20685
Differential Revision: D15414819
Pulled By: colesbury
fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440
Summary:
Add base support for torch.logspace. See #19220 for details.
SsnL can you feedback? Thanks a lot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19542
Differential Revision: D15028484
Pulled By: soumith
fbshipit-source-id: fe5a58a203b279103abbc192c754c25d5031498e
Summary:
Changelog:
- Rename `potri` to `cholesky_inverse` to remain consistent with names of `cholesky` methods (`cholesky`, `cholesky_solve`)
- Fix all callsites
- Rename all tests
- Create a tentative alias for `cholesky_inverse` under the name `potri` and add a deprecation warning to not promote usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19498
Differential Revision: D15029901
Pulled By: ezyang
fbshipit-source-id: 2074286dc93d8744cdc9a45d54644fe57df3a57a
Summary:
This adds checks for `mul_`, `add_`, `sub_`, `div_`, the most common
binops. See #17935 for more details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19317
Differential Revision: D14972399
Pulled By: zou3519
fbshipit-source-id: b9de331dbdb2544ee859ded725a5b5659bfd11d2
Summary:
Unit tests that hang on clock64() calls are now fixed.
test_gamma_gpu_sample is now fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19307
Differential Revision: D14953420
Pulled By: bddppq
fbshipit-source-id: efe807b54e047578415eb1b1e03f8ad44ea27c13
Summary:
The caching allocator tries to free all blocks on an out-of-memory
error. Previously, it did not free blocks that still had outstanding
stream uses. This change synchronizes on the outstanding events and
frees those blocks.
See #19219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19222
Differential Revision: D14925071
Pulled By: colesbury
fbshipit-source-id: a2e9fe957ec11b00ea8e6c0468436c519667c558
Summary:
Enable multi-GPU tests that work with ROCm 2.2. Have been run three times on CI to ensure stability.
While there, remove skipIfRocm annotations for tests that depend on MAGMA. They still skip but now for the correct reason (no MAGMA) to improve our diagnostics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19169
Differential Revision: D14924812
Pulled By: bddppq
fbshipit-source-id: 8b88f58bba58a08ddcd439e899a0abc6198fef64
Summary:
Changelog:
- Rename `btrisolve` to `lu_solve` to remain consistent with names of solve methods (`cholesky_solve`, `triangular_solve`, `solve`)
- Fix all callsites
- Rename all tests
- Create a tentative alias for `lu_solve` under the name `btrisolve` and add a deprecation warning to not promote usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18726
Differential Revision: D14726237
Pulled By: zou3519
fbshipit-source-id: bf25f6c79062183a4153015e0ec7ebab2c8b986b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
Changelog:
- Renames `btriunpack` to `lu_unpack` to remain consistent with the `lu` function interface.
- Rename all relevant tests, fix callsites
- Create a tentative alias for `lu_unpack` under the name `btriunpack` and add a deprecation warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18529
Differential Revision: D14683161
Pulled By: soumith
fbshipit-source-id: 994287eaa15c50fd74c2f1c7646edfc61e8099b1
Summary:
Changelog:
- Renames `btrifact` and `btrifact_with_info` to `lu`to remain consistent with other factorization methods (`qr` and `svd`).
- Now, we will only have one function and methods named `lu`, which performs `lu` decomposition. This function takes a get_infos kwarg, which when set to True includes a infos tensor in the tuple.
- Rename all tests, fix callsites
- Create a tentative alias for `lu` under the name `btrifact` and `btrifact_with_info`, and add a deprecation warning to not promote usage.
- Add the single batch version for `lu` so that users don't have to unsqueeze and squeeze for a single square matrix (see changes in determinant computation in `LinearAlgebra.cpp`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18435
Differential Revision: D14680352
Pulled By: soumith
fbshipit-source-id: af58dfc11fa53d9e8e0318c720beaf5502978cd8
Summary:
Enable unit tests working with ROCm 2.3. In particular, these are unit tests where we skipped for double data types previously and some tests for multi-GPU setups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18537
Differential Revision: D14651822
Pulled By: ezyang
fbshipit-source-id: 7dd575504ebe235a91489866c91000e9754b1235
Summary:
Changelog:
- Renames `trtrs` to `triangular_solve` to remain consistent with `cholesky_solve` and `solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `triangular_solve` under the name `trtrs`, and add a deprecation warning to not promote usage.
- Move `isnan` to _torch_docs.py
- Remove unnecessary imports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18213
Differential Revision: D14566902
Pulled By: ezyang
fbshipit-source-id: 544f57c29477df391bacd5de700bed1add456d3f
Summary:
- Remove single batch TH/THC implementations
- Remove `_batch_trtrs_lower` from `multivariate_normal`
- Add tests for batched behavior
- Modify trtrs_backward to accommodate for batched case
- Modify docs
In a future PR, this will be renamed to `triangular_solve`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18025
Differential Revision: D14523004
Pulled By: ifedan
fbshipit-source-id: 11c6a967d107f969b60e5a5c73ce6bb8099ebbe1
Summary:
Changelog:
- Renames `gesv` to `solve` to remain consistent with `cholesky_solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `solve` under the name `gesv`, and add a deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18060
Differential Revision: D14503117
Pulled By: zou3519
fbshipit-source-id: 99c16d94e5970a19d7584b5915f051c030d49ff5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17927
ghimport-source-id: 626d321e430b6b5c0ea3aa1eb9df8c1e2d058bf8
Stack:
* #17926 Implement at::has_internal_overlap helper function
* **#17927 Error out on in-place (unary) ops on tensors that have internal overlap**
On the way to #17935.
Works for CPU and CUDA on the following ops:
- abs_, acos_, asin_, atan_, ceil_, cos_, erf_, erfc_, exp_, expm1_
- floor_, log_, log10_, log1p_, log2_, round_, rsqrt_,
- sin_, sqrt_, tan_, tanh_, trunc_
This PR adds a check to see if the out/result tensor has internal
overlap. If it does, then we error out because the result **may** be
incorrect.
This is overly conservative; there are some cases where if the result is
the same as the input, the inplace operation is OK (such as floor_,
round_, and trunc_). However, the current code isn't organized in such a
way that this is easy to check, so enabling those will come in the future.
Reviewed By: ezyang
Differential Revision: D14438871
fbshipit-source-id: 15e12bf1fdb2ab7f74bb806e22bc74840bd6abd1
Summary:
ROCm 2.2 was released today, if we respin the CI docker images with the attached, PyTorch/Caffe2 will support ROCm 2.2
Changes necessary:
* for the Ubuntu target, HIP PR 934 needs to be applied to fix the forceinline definition. ROCm 2.3 will contain this.
* two unit tests proof flaky on different platforms, disable them defensively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18007
Differential Revision: D14473903
Pulled By: bddppq
fbshipit-source-id: b1939f11d1c765a3bf71bb244b15f6ceb0e816d3
Summary:
This PR causes kthvalue to be consistent with sort
(i.e. treat NaN as larger than any number), so that
`a.kthvalue(n) == a.sort()[n - 1]`.
One drawback is that median with a NaN argument does not return NaN,
which is a deviation from NumPy.
Thank you, ngimel, for raising this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17824
Differential Revision: D14410092
Pulled By: ezyang
fbshipit-source-id: bdec2d8272dc4c65bcf2f9b8995e237774c44c02