Commit Graph

110 Commits

Author SHA1 Message Date
Edward Yang
1f36ce6e4d Restore storage on meta tensors; increase meta coverage (#53973)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53973

Two parts to this PR; I had to put them together because adding support for X causes more test code to be exercised, which in turn may require a fix for Y.

The first part is restoring the concept of storage to meta tensors.  Previously, meta tensors had a nullptr storage (e.g., `meta_tensor.storage()` is an error.) As I was increasing the coverage of meta tensors, I started running into test cases (specifically memory overlap tests) that were failing because not having storage meant I couldn't check for memory overlap. After some discussion, we decided that it would make sense for meta tensors to model this as well (we already model strides, so getting accurate view information also seems useful). This PR does that by:

* Rewrite all of the factory functions in MetaTensor.cpp to use the generic versions (which are very carefully written to not actually poke at the data pointer, so everything works out). The key idea here is we give meta tensors a special allocator, MetaAllocator, which always returns a nullptr even if you ask for a nonzero number of bytes. resize_ is also made generic; the normal variant can be used directly rather than having to instruct it to avoid resizing storage
* Turn on memory overlap checking in TensorIterator even for meta tensors
* Although meta tensors now have storage, the concept of meta storage is NOT exposed to Python land (as it would imply I would have to codegen MetaFloatStorage, MetaDoubleStorage, etc. classes). So `x.storage()` still raises an error and I have a cludge in `__deepcopy__` to break storage sharing upon deep copy (this is wrong, but no tests exercise this at the moment).

The second part is adding more support for the most used functions in the test suite.

* Inplace operations have very simple meta functions. I added `fill_`, `zero_`, `random_`, `uniform_` and `normal_`. In the case of random, I take advantage of pbelevich's templates for defining random kernels, so that I can reuse the common scaffolding, and then just register a noop stub that actually does the RNG. (Look, another structured kernels tiny variant!)
* `copy_` is now implemented. Copying into a meta tensor is always OK, but copying out of a meta tensor raises an error (as we don't know what the "correct" data to copy out is in this case)
* `empty_strided` usage from structured kernels now is implemented (TBH, this could have been done as soon as `empty_strided` was added)
* Meta was missing in a few places in TensorOptions/DispatchKey utility functions, so I added them
* Autograd engine now correctly homes meta tensors with CPU tensors (they have -1 device index so CUDA queues wouldn't work anyway)
* `apply_`, `map_` and `map2_` are special cased to no-op on meta tensor self. These count as inplace operations too but they are implemented a little differently.

Getting more meta function support triggers a number of bugs in the test suite, which I then fix:

- Linear algebra functions sometimes don't report NotImplementedError because they get swallowed by catch all try blocks. This is tracked in https://github.com/pytorch/pytorch/issues/53739
- dlpack obviously doesn't work with meta tensors, I just disabled the test

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D27036572

Test Plan: Imported from OSS

Reviewed By: agolynski, bdhirsh

Pulled By: ezyang

fbshipit-source-id: 7005ecf4feb92a643c37389fdfbd852dbf00ac78
2021-03-29 08:37:46 -07:00
Heitor Schueroff
f9e7f132fb Added torch.linalg.matrix_power (#52608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52608

**TODO**

- [x] Add OpInfo
- [x] Update documentation
- [x] Add more tests and compare against NumPy

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D27261532

Pulled By: heitorschueroff

fbshipit-source-id: c1e4ab297da3683f6d5751be8790602f9dc37b6b
2021-03-23 15:10:06 -07:00
Mike Ruberry
544a996f83 Revert D27155845: [pytorch][PR] Fixed the size of the workspace array in functions calling MAGMA
Test Plan: revert-hammer

Differential Revision:
D27155845 (04a2506091)

Original commit changeset: 04439bfa82a5

fbshipit-source-id: f45967e94883effbb43d8d0a019596f1f82caa56
2021-03-19 08:27:18 -07:00
Ivan Yashchuk
04a2506091 Fixed the size of the workspace array in functions calling MAGMA (#54009)
Summary:
The size of the workspace arrays should not be less than 1. This PR fixes lstsq calls to LAPACK and MAGMA. Also `max(1, ...)` guards were added to a few other functions (symeig, svd).
ROCm testing is enabled for lstsq, pinv, pinverse.

Fixes https://github.com/pytorch/pytorch/issues/53976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54009

Reviewed By: ejguan

Differential Revision: D27155845

Pulled By: mruberry

fbshipit-source-id: 04439bfa82a5bdbe2297a6d62b6e68ba1c30e4a2
2021-03-18 10:07:45 -07:00
Kurt Mohler
382a47b493 Add torch.linalg.vector_norm function (#51099)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51099

Reviewed By: agolynski

Differential Revision: D27147360

Pulled By: mruberry

fbshipit-source-id: 1056f840e7027ad81971c9d1a9f952ab9648f1b5
2021-03-18 06:41:39 -07:00
Ivan Yashchuk
564456ac44 Added autograd support for torch.orgqr (#52637)
Summary:
This PR adds autograd support for `torch.orgqr`.

Since `torch.orgqr` is one of few functions that expose LAPACK's naming and all other linear algebra routines were renamed a long time ago, I also added a new function with a new name and `torch.orgqr` now is an alias for it.

The new proposed name is `householder_product`. For a matrix `input` and a vector `tau` LAPACK's orgqr operation takes columns of `input` (called Householder vectors or elementary reflectors) scalars of `tau` that together represent Householder matrices and then the product of these matrices is computed. See https://www.netlib.org/lapack/lug/node128.html.
Other linear algebra libraries that I'm aware of do not expose this LAPACK function, so there is some freedom in naming it. It is usually used internally only for QR decomposition, but can be useful for deep learning tasks now when it supports differentiation.

Resolves https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52637

Reviewed By: agolynski

Differential Revision: D27114246

Pulled By: mruberry

fbshipit-source-id: 9ab51efe52aec7c137aa018c7bd486297e4111ce
2021-03-18 05:42:18 -07:00
Edward Yang
c2f41b6b84 Add meta device to generic device testing framework, skip NotImplementedError (#53682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53682

With this, under the meta device, 101 tests passed and 16953 skipped.
It ain't much, but it's a start.

Some various bits and bobs:
- NotImplementedError suppression at test level is implemented
  in the same way as CUDA memory leak check, i.e., by wrapping
  test methods and monkeypatching them back in.
- I had to reimplement assertRaises/assertRaisesRegex from scratch to
  ignore NotImplementedError when _ignore_not_implemented_error is True.
  The implementation relies on a small amount of private API that hasn't
  changed since 2010
- expectedAlertNondeterministic doesn't really work so I skipped them
  all; there's probably a way to do it better

I tested this using `pytest --disable-warnings --tb=native -k meta --sw
test/*.py` and a pile of extra patches to make collection actually work
(lol).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26955539

Pulled By: ezyang

fbshipit-source-id: ac21c8734562497fdcca3b614a28010bc4c03d74
2021-03-14 20:41:19 -07:00
Mike Ruberry
319ab58e27 Skips test_linalg_lstsq on ROCm (#53977)
Summary:
This test is flaky (tracked in https://github.com/pytorch/pytorch/issues/53976). This PR skips it to let the rest of the ROCm CI run.

cc nikitaved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53977

Reviewed By: ngimel

Differential Revision: D27036705

Pulled By: mruberry

fbshipit-source-id: 5bae741fd2a68f23717cb3a7c8b73e97cfb23b5c
2021-03-14 05:42:39 -07:00
Ivan Yashchuk
7df176b1f9 Added OpInfo-based testing of some linalg functions (#51107)
Summary:
Added OpInfo-based testing of the following linear algebra functions:
* cholesky, linalg.cholesky
* linalg.eigh
* inverse, linalg.inv
* qr, linalg.qr
* solve

The output of `torch.linalg.pinv` for empty inputs was not differentiable, now it's fixed.

In some cases, batched grad checks are disabled because it doesn't work well with 0x0 matrices (see https://github.com/pytorch/pytorch/issues/50743#issuecomment-767376085).

Ref. https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51107

Reviewed By: albanD

Differential Revision: D27006115

Pulled By: mruberry

fbshipit-source-id: 3c1d00e3d506948da25d612fb114e6d4a478c5b1
2021-03-14 01:10:02 -08:00
Mike Ruberry
d46978cc55 Refines test_orgqr_* skip (#53975)
Summary:
https://github.com/pytorch/pytorch/pull/51348 added CUDA support for orgqr but only a cuSOLVER path; the orgqr tests, however, were marked to run on builds with either MAGMA or cuSOLVER.

This PR addresses the issue by creating a skipCUDAIfNoCusolver decator and applying to the orgqr tests. It triggers ci-all because our CI build with MAGMA but no cuSOLVER is CUDA 9.2, which does run in the typical PR CI.

cc IvanYashchuk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53975

Reviewed By: ngimel

Differential Revision: D27036683

Pulled By: mruberry

fbshipit-source-id: f6c0a3e526bde08c44b119ed2ae5d51fee27e283
2021-03-14 00:41:26 -08:00
Ivan Yashchuk
fe08671756 Added cuBLAS path for torch.triangular_solve (#53147)
Summary:
This PR adds the cuBLAS based path for `torch.triangular_solve`
The device dispatching helper function was removed from native_functions.yml, it is replaced with DECLARE/DEFINE_DISPATCH.

`magmaTriangularSolve` is removed and replaced with cuBLAS calls, this is not a BC-breaking change because internally MAGMA just calls the same cuBLAS function and doesn't do anything else.

Batched cuBLAS is faster than batched MAGMA for matrices of size up until 512x512, after that MAGMA is faster. For batches smaller than ~8 and matrix sizes larger than 64x64 a forloop of cuBLAS calls is faster than batched version.

Ref. https://github.com/pytorch/pytorch/issues/47953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53147

Reviewed By: heitorschueroff

Differential Revision: D27007416

Pulled By: mruberry

fbshipit-source-id: ddfc190346e6a56b84145ed0a9af67ca9cde3506
2021-03-12 13:38:42 -08:00
Nikita Vedeneev
afa1ff8e04 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: albanD

Differential Revision: D26991788

Pulled By: mruberry

fbshipit-source-id: 8af9ada979240b255402f55210c0af1cba6a0a3c
2021-03-12 13:25:55 -08:00
Nikita Vedeneev
8f15a2f052 eig_backward: faster and with complex support (#52875)
Summary:
As per title. Compared to the previous version, it is lighter on the usage of `at::solve` and `at::matmul` methods.

Fixes https://github.com/pytorch/pytorch/issues/51621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52875

Reviewed By: mrshenli

Differential Revision: D26768653

Pulled By: anjali411

fbshipit-source-id: aab141968d02587440128003203fed4b94c4c655
2021-03-10 11:33:30 -08:00
Ivan Yashchuk
e937db5dba Added CUDA support for torch.orgqr (#51348)
Summary:
**Update:** MAGMA support was dropped from this PR. Only the cuSOLVER path is implemented and it's used exclusively.

**Original PR message:**

This PR adds support for CUDA inputs for `torch.orgqr`.

CUDA implementation is based on both [cuSOLVER](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) and MAGMA. cuSOLVER doesn't have a specialized routine for the batched case. While MAGMA doesn't have a specialized GPU native (without CPU sync) `orgqr`. But MAGMA has implemented (and not documented) the batched GPU native version of `larft` function (for small inputs of size <= 32), which together with `larfb` operation form `orgqr` (see the call graph [here at the end of the page](http://www.netlib.org/lapack/explore-html/da/dba/group__double_o_t_h_e_rcomputational_ga14b45f7374dc8654073aa06879c1c459.html)).

So now there are two main codepaths for CUDA inputs (if both MAGMA and cuSOLVER are available):
* if `batchsize > 1` and `tau.shape[-1] <= 32` then MAGMA based function is called
* else [cuSOLVER's `orgqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) is used.

If MAGMA is not available then only cuSOLVER is used and vice versa.

Documentation updates and possibly a new name for this function will be in a follow-up PR.

Ref. https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51348

Reviewed By: heitorschueroff

Differential Revision: D26882415

Pulled By: mruberry

fbshipit-source-id: 9f91ff962921932777ff108bedc133b55fe22842
2021-03-10 09:59:56 -08:00
mattip
54a2498919 Modify tests to use assertWarnsOnceRegex instead of maybeWarnsRegex (#52387)
Summary:
Related to https://github.com/pytorch/pytorch/issues/50006

Follow on for https://github.com/pytorch/pytorch/issues/48560 to ensure TORCH_WARN_ONCE warnings are caught. Most of this is straight-forward find-and-replace, but I did find one place where the TORCH_WARN_ONCE warning was not wrapped into a python warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52387

Reviewed By: albanD

Differential Revision: D26773387

Pulled By: mruberry

fbshipit-source-id: 5be7efbc8ab4a32ec8437c9c45f3b6c3c328f5dd
2021-03-08 03:32:14 -08:00
Peter Bell
5ebfabb310 MAGMA: Initialize ipiv data to avoid internal memory access violation (#53064)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51930

Running the reproducer under `cuda-gdb`, I see access violations in either [`zswap_kernel_batched`](4fd4634f35/magmablas/zgetf2_kernels.cu (lines-276)) (part of the LU factorization) and other times in [`zlaswp_columnserial_kernel`](4fd4634f35/magmablas/zlaswp_batched.cu (lines-335)) (part of the inverse).

The common factor between both of these is they use `ipiv` to index into the matrix. My best guess is the `ipiv` indices aren't written when the factorization fails, hence garbage data is used as matrix indices and we get an access violation. Initializing `ipiv` to a known-good value before the  factorization fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53064

Reviewed By: zhangguanheng66

Differential Revision: D26829053

Pulled By: heitorschueroff

fbshipit-source-id: 842854a6ee182f20b2acad0d76d32d27cb51b061
2021-03-05 08:59:27 -08:00
Kyle Chen
bf5e5bf901 [ROCm] Enable test in test_linalg.py, test_optim.py and test_vmap.py … (#52818)
Summary:
Enable test in test_linalg.py, test_optim.py, and test_vmap.py for ROCm because they are passing.

Signed-off-by: Kyle Chen <kylechen@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52818

Reviewed By: H-Huang

Differential Revision: D26694091

Pulled By: mruberry

fbshipit-source-id: 285d17aa7f271f4d94b5fa9d9f6620de8a70847b
2021-03-04 02:29:45 -08:00
Mike Ruberry
9c2673df46 Revert D26723384: [pytorch][PR] Implements torch.linalg.lstsq
Test Plan: revert-hammer

Differential Revision:
D26723384 (3ac9013235)

Original commit changeset: c9866a95f140

fbshipit-source-id: 3e5263d71facdc91ca09d7dcbbbe3ba818ee2821
2021-03-03 15:24:25 -08:00
Mike Ruberry
20860ab01a Revert D26727918: [pytorch][PR] Added CUDA support for torch.orgqr
Test Plan: revert-hammer

Differential Revision:
D26727918 (e29d8477a6)

Original commit changeset: 1c4d15fa76ba

fbshipit-source-id: f3d5d6811ab77332a333cd165d69fcd9ecd92dc6
2021-03-03 10:06:49 -08:00
Ivan Yashchuk
926e011cde Fixed out= variant of linalg.solve (#51968)
Summary:
This PR modifies the behavior of the `linalg_solve_out` variant to match the description here https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
With this PR result and input tensors must be on the same device and have the same "type kind".
It's allowed to pass out tensors with complex dtypes for float inputs.

`linalg_solve_out` was broken for batched vector inputs and it's now fixed.

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51968

Reviewed By: H-Huang

Differential Revision: D26728825

Pulled By: mruberry

fbshipit-source-id: c06fe937e7f452193b23ba09ca6cfa2703488455
2021-03-02 22:33:19 -08:00
Ivan Yashchuk
e29d8477a6 Added CUDA support for torch.orgqr (#51348)
Summary:
This PR adds support for CUDA inputs for `torch.orgqr`.

CUDA implementation is based on both [cuSOLVER](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) and MAGMA. cuSOLVER doesn't have a specialized routine for the batched case. While MAGMA doesn't have a specialized GPU native (without CPU sync) `orgqr`. But MAGMA has implemented (and not documented) the batched GPU native version of `larft` function (for small inputs of size <= 32), which together with `larfb` operation form `orgqr` (see the call graph [here at the end of the page](http://www.netlib.org/lapack/explore-html/da/dba/group__double_o_t_h_e_rcomputational_ga14b45f7374dc8654073aa06879c1c459.html)).

So now there are two main codepaths for CUDA inputs (if both MAGMA and cuSOLVER are available):
* if `batchsize > 1` and `tau.shape[-1] <= 32` then MAGMA based function is called
* else [cuSOLVER's `orgqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-orgqr) is used.

If MAGMA is not available then only cuSOLVER is used and vice versa.

Documentation updates and possibly a new name for this function will be in a follow-up PR.

Ref. https://github.com/pytorch/pytorch/issues/50104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51348

Reviewed By: ngimel

Differential Revision: D26727918

Pulled By: mruberry

fbshipit-source-id: 1c4d15fa76ba624e341a69a32337a9a16cc01013
2021-03-02 21:34:23 -08:00
Nikita Vedeneev
3ac9013235 Implements torch.linalg.lstsq (#49093)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44378 by providing a wider range of drivers similar to what SciPy is doing.

The supported CPU drivers are `gels, gelsy, gelsd, gelss`.
The CUDA interface has only `gels` implemented but only for overdetermined systems.

The current state of this PR:
- [x] CPU interface
- [x] CUDA interface
- [x] CPU tests
- [x] CUDA tests
- [x] Memory-efficient batch-wise iteration with broadcasting which fixes https://github.com/pytorch/pytorch/issues/49252
- [x] docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49093

Reviewed By: H-Huang

Differential Revision: D26723384

Pulled By: mruberry

fbshipit-source-id: c9866a95f14091955cf42de22f4ac9e2da009713
2021-03-02 19:00:07 -08:00
Ivan Yashchuk
870bac13bc Fixed out= variant of linalg.inv (#51977)
Summary:
This PR modifies the behavior of the `linalg_inv_out` variant to match the description here https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
With this PR result and input tensors must be on the same device and have the same "type kind".
It's allowed to pass out tensors with complex dtypes for float inputs.

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51977

Reviewed By: H-Huang

Differential Revision: D26725718

Pulled By: mruberry

fbshipit-source-id: 2acc2a311328268706ce27ce060fc88fc7416753
2021-03-02 18:45:29 -08:00
Luca Wehrstedt
92a4ee1cf6 Revert D26375734: Implemented torch.linalg.multi_dot
Test Plan: revert-hammer

Differential Revision:
D26375734 (0396f492b9)

Original commit changeset: 839642692424

fbshipit-source-id: cb64db646010128d802e1930d5e9526c1f7aa6a2
2021-02-25 00:43:57 -08:00
Heitor Schueroff
0396f492b9 Implemented torch.linalg.multi_dot (#51807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51807

Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).

This function does not support broadcasting or batched inputs at the moment.

**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.

**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26375734

Pulled By: heitorschueroff

fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
2021-02-24 15:32:30 -08:00
Ivan Yashchuk
7ca9776874 Fixed _out variants of linear algebra functions (#51560)
Summary:
This PR modifies the behavior of `_out` variants to match the description here https://github.com/pytorch/pytorch/wiki/Developer-FAQ#how-does-out-work-in-pytorch
With this PR result and input tensors must be on the same device and have the same "type kind".

I skipped `qr` and `eig` in this process as they require a bit more work.

Functions that can use the provided storage directly do so. If `result` is not empty and not in the batched column-major format or does not have the same type as input then we have to allocate a temporary tensor and copy it.

TODO:

- [x] Add more tests for same device and valid safe dtype
- [x] Move inv and solve changes to separate PRs https://github.com/pytorch/pytorch/pull/51968, https://github.com/pytorch/pytorch/pull/51977

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51560

Reviewed By: albanD

Differential Revision: D26400734

Pulled By: heitorschueroff

fbshipit-source-id: a6201ed7e919c1670c6ff3ef60217d1dbfb72e67
2021-02-19 04:03:35 -08:00
Jeff Daily
70a805a286 [ROCm] skip one more magma test that is flaky (#52064)
Summary:
Skipped hipMAGMA tests are tracked in https://github.com/pytorch/pytorch/issues/51303.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52064

Reviewed By: albanD

Differential Revision: D26406745

Pulled By: walterddr

fbshipit-source-id: 2405ea06e03450eb22177c2c8b12a366cfbdaa93
2021-02-11 14:02:52 -08:00
Jeff Daily
5dd1568aa3 [ROCm] skip more magma tests (#51915)
Summary:
Additional magma tests have been identified as failing after integrating hipMAGMA into the ROCm builds.  Skipping is necessary until they can be fixed properly.  This is blocking migration of ROCm CI to 4.0.1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51915

Reviewed By: izdeby

Differential Revision: D26326404

Pulled By: malfet

fbshipit-source-id: 558cce66f216f404c0316ab036e2e5637fc99798
2021-02-09 09:14:42 -08:00
Jeff Daily
d02ea9a141 [ROCm] add hipMAGMA support (#51238)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48831.

- CI image is updated to build hipMAGMA from source and set env MAGMA_HOME.
- CMake is updated to separate different requirements for CUDA versus ROCm MAGMA.
- Some unit tests that become enabled with MAGMA are currently skipped for ROCm due to failures.  Fixing these failures will be follow-on work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51238

Reviewed By: ngimel

Differential Revision: D26184918

Pulled By: malfet

fbshipit-source-id: ada632f1ae7b413e8cae6543fe931dcd46985821
2021-02-01 22:09:33 -08:00
Ivan Yashchuk
5e09ec6518 Fixed SVD ignoring "some/full_matrices" flag for empty inputs (#51109)
Summary:
For empty inputs `torch.svd` (and `torch.linalg.svd`) was returning incorrect results for `some=True` (`full_matrices=False`).
Behaviour on master branch:
```python
In [1]: import torch
In [2]: a = torch.randn(0, 7)
In [3]: a.svd()
Out[3]:
torch.return_types.svd(
U=tensor([], size=(0, 0)),
S=tensor([]),
V=tensor([[0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.]]))
In [4]: a.svd(some=False)
Out[4]:
torch.return_types.svd(
U=tensor([], size=(0, 0)),
S=tensor([]),
V=tensor([[0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0.]]))
```
`some` flag is ignored and 7x7 `V` matrix is returned in both cases. `V` should have 7x0 shape when `some=True`.

This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51109

Reviewed By: ngimel

Differential Revision: D26170897

Pulled By: mruberry

fbshipit-source-id: 664c09ca27bb375fabef2a046d0a09ca57b01aac
2021-02-01 21:51:58 -08:00
Ivan Yashchuk
30675d0921 Added OpInfo-based testing of triangular_solve (#50948)
Summary:
Added OpInfo-based testing of `torch.triangular_solve`.

These tests helped to discover that CPU `triangular_solve` wasn't working for empty matrices and for CUDA inputs a warning was printed to the terminal. It is fixed now.

CUDA gradgrad checks are skipped.
```
11.44s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_complex128
2.97s call     test/test_ops.py::TestGradientsCUDA::test_fn_gradgrad_triangular_solve_cuda_float64
1.60s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_complex128
1.36s call     test/test_ops.py::TestOpInfoCUDA::test_supported_dtypes_triangular_solve_cuda_complex128
1.20s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_complex128
0.86s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex64
0.85s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_complex128
0.81s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float64
0.77s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_triangular_solve_cuda_float32
0.46s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex128
0.44s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_complex64
0.44s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_triangular_solve_cuda_float64
0.42s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_triangular_solve_cpu_float64
0.40s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float32
0.40s call     test/test_ops.py::TestCommonCPU::test_variant_consistency_jit_triangular_solve_cpu_float64
0.17s call     test/test_ops.py::TestGradientsCPU::test_fn_grad_triangular_solve_cpu_complex128
```

Ref. https://github.com/pytorch/pytorch/issues/50006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50948

Reviewed By: ailzhang

Differential Revision: D26123998

Pulled By: mruberry

fbshipit-source-id: 54136e8fc8a71f107dddb692c5be298c6d5ed168
2021-01-29 10:31:07 -08:00
Jeffrey Wan
c0966914bc Internal gradcheck wrapper in testing._internal that sets certain flags to True (#51133)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49409

There are many call sites where, gradcheck/gradgradcheck is now being implicitly invoked with `check_batched_grad` as True, but they were previously False. Cases fall into two basic categories:
1) the call site was previously using `torch.autograd.gradcheck` but is now changed to use the globally imported function instead
3) the call site was already using globally imported function, but does not explicitly pass `check_batched_grad` flag

Only in the _assertGradAndGradgradChecks cases, which are infrequent, I assumed that the the author is aware that omitting the flag means not applying check_batched_grad=True. (but maybe that is not the case?)

Overall this PR in its current state assumes that unless the author explicitly specified `check_batched_grad=False`, they were just probably not aware of this flag and did not mean to have this flag as False.

So far exceptions to the above (as discovered by CI) include:
 - Mkldnn (opaque tensors do not have strides) https://app.circleci.com/pipelines/github/pytorch/pytorch/264416/workflows/e4d87886-6247-4305-8526-2696130aa9a4/jobs/10401882/tests
 - all cases in test_sparse (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407103)
 - all cases in test_overrides (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407236)
 - test_autograd (test_LSTM_grad_and_gradgrad) - (https://app.circleci.com/pipelines/github/pytorch/pytorch/264553/workflows/3c1cbe30-830d-4acd-b240-38d833dccd9b/jobs/10407235)
 - test_data_parallel (test_data_parallel_buffers_requiring_grad) - *SIGSEGV* (https://app.circleci.com/pipelines/github/pytorch/pytorch/264820/workflows/14d89503-040d-4e3d-9f7b-0bc04833589b/jobs/10422697)
 - test_nn (https://app.circleci.com/pipelines/github/pytorch/pytorch/264919/workflows/df79e3ed-8a31-4a8e-b584-858ee99686ff/jobs/10427315)

Possible TODO is to prevent new tests from invoking external gradcheck.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51133

Reviewed By: ezyang

Differential Revision: D26147919

Pulled By: soulitzer

fbshipit-source-id: dff883b50f337510a89f391ea2fd87de2d531432
2021-01-29 09:13:37 -08:00
Ivan Yashchuk
6e4746c1ac Port cholesky_inverse to ATen (#50269)
Summary:
Now we can remove `_th_potri`!

Compared to the original TH-based `cholesky_inverse`, complex (https://github.com/pytorch/pytorch/issues/33152) and batched inputs (https://github.com/pytorch/pytorch/issues/7500) are now supported both on CPU and CUDA.

Closes https://github.com/pytorch/pytorch/issues/24685.
Closes https://github.com/pytorch/pytorch/issues/24543.

Ref. https://github.com/pytorch/pytorch/issues/49421, https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50269

Reviewed By: bdhirsh

Differential Revision: D26047548

Pulled By: anjali411

fbshipit-source-id: e4f191e39c684f241b7cb0f4b4c025de082cccef
2021-01-28 16:24:41 -08:00
Scott Wolchok
1321f2bfe6 [PyTorch] Port Caffe2 opti for BatchMatMul batch size 1 to baddbmm (#51057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51057

Caffe2 has an
[optimization](f8eefbdf7a/caffe2/operators/batch_matmul_op.h (L192))
for the case where the batch size is 1 that uses the underlying `gemm`
instead of `gemm_batched` BLAS function. This diff tries to port that
optimization to `baddbmm_mkl`.

Note that I have very little linear algebra background and am just
going off existing code and cblas API documentation, so please
review without assuming I know what I'm doing with the math itself.
ghstack-source-id: 120342923

Reviewed By: hlu1

Differential Revision: D26056613

fbshipit-source-id: feef80344b96601fc2bd0a2e8c8f6b57510d7856
2021-01-27 15:59:57 -08:00
Gao, Xiang
16dd5ca8ab Followup of kron PR (#51045)
Summary:
Followup of https://github.com/pytorch/pytorch/pull/50927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51045

Reviewed By: mruberry

Differential Revision: D26089204

Pulled By: ngimel

fbshipit-source-id: 77291dd83fba32d6f80a8540910b112a1d85a892
2021-01-27 10:33:05 -08:00
Xiang Gao
ba316a7612 Fix TF32 failures in test_linalg.py (#50453)
Summary:
On Ampere GPU, matmuls are computed by default with TF32 when the dtype is `torch.float`:  https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices, which results in reduced precision in results. However, linear algebra usually need higher precision, therefore lots of tests in `test_linalg.py` are failing on Ampere GPU because of precision issue.

To fix this issue:
- Most linear algebra methods, except for matmuls, should add `NoTF32Guard`
- Expected results in unit tests should compute matmuls using numpy instead of pytorch cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50453

Reviewed By: glaringlee

Differential Revision: D26023005

Pulled By: ngimel

fbshipit-source-id: f0ea533494fee322b07925565b57e3b0db2570c5
2021-01-26 19:51:20 -08:00
Xiang Gao
b822aba8ec Enable BFloat support for gemms on arch other than ampere (#50442)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50442

Reviewed By: bdhirsh

Differential Revision: D26044981

Pulled By: mruberry

fbshipit-source-id: 65c42f2c1de8d24e4852a1b5bd8f4b1735b2230e
2021-01-26 11:07:07 -08:00
Antonio Cuni
880f007480 Add torch.eig complex forward (CPU, CUDA) (#49168)
Summary:
Related to issue https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49168

Reviewed By: mrshenli

Differential Revision: D25954027

Pulled By: mruberry

fbshipit-source-id: e429f9587efff5e638bfd0e4de864c06f41c63b1
2021-01-25 21:27:08 -08:00
Ivan Yashchuk
ddf26816d3 Make torch.svd return V, not V.conj() for complex inputs (#51012)
Summary:
**BC-breaking note:**

torch.svd() added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex "V" tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.

This will silently break all users of torch.svd() with complex inputs.

**Original PR Summary:**

This PR resolves https://github.com/pytorch/pytorch/issues/45821.

The problem was that when introducing the support of complex inputs for `torch.svd` it was overlooked that LAPACK/MAGMA returns the conjugate transpose of V matrix, not just the transpose of V. So `torch.svd` was silently returning U, S, V.conj() instead of U, S, V.

Behavior of `torch.linalg.pinv`, `torch.pinverse` and `torch.linalg.svd` (they depend on `torch.svd`) is not changed in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51012

Reviewed By: bdhirsh

Differential Revision: D26047593

Pulled By: albanD

fbshipit-source-id: d1e08dbc3aab9ce1150a95806ef3b5da98b5d3ca
2021-01-25 14:06:41 -08:00
Heitor Schueroff
a7cf04ec40 Workaround for MAGMA accessing illegal memory in batched cholesky (#50957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50957

MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.

**Benchmark**
Ran the script below for both before and after my PR and got similar results.

*Script*
```
import torch
from torch.utils import benchmark

DTYPE = torch.float32
BATCHSIZE = 512 * 512
MATRIXSIZE = 16

a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE)

t0 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a},
    label='Single'
)

t1 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a.expand(BATCHSIZE, -1, -1)},
    label='Batched'
)

print(t0.timeit(100))
print(t1.timeit(100))
```

*Results before*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.08 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.68 ms
  1 measurement, 100 runs , 1 thread
```

*Results after*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.10 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.56 ms
  1 measurement, 100 runs , 1 thread
```

Fixes https://github.com/pytorch/pytorch/issues/41394, https://github.com/pytorch/pytorch/issues/26996, https://github.com/pytorch/pytorch/issues/48996

See also https://github.com/pytorch/pytorch/issues/42666, https://github.com/pytorch/pytorch/pull/26789

TODO
 ---
- [x] Benchmark to check for perf regressions

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26050978

Pulled By: heitorschueroff

fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86
2021-01-25 13:39:24 -08:00
Ivan Yashchuk
627a331257 Port CPU torch.orgqr to ATen (#50502)
Summary:
Now we can remove `_th_orgqr`!

Compared to the original TH-based `orgqr`, complex (https://github.com/pytorch/pytorch/issues/33152) and batched inputs are now supported.
CUDA support will be added in a follow-up PR.

Closes https://github.com/pytorch/pytorch/issues/24747

Ref. https://github.com/pytorch/pytorch/issues/49421, https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50502

Reviewed By: mrshenli

Differential Revision: D25953300

Pulled By: mruberry

fbshipit-source-id: f52a74e1c8f51b5e24f7b461430ca8fc96e4d149
2021-01-25 02:57:05 -08:00
Xiao Wang
186c3da037 Add cusolver gesvdj and gesvdjBatched to the backend of torch.svd (#48436)
Summary:
This PR adds cusolver `gesvdj` and `gesvdjBatched` to the backend of `torch.svd`.

I've tested the performance using cuda 11.1 on 2070, V100, and A100. The cusolver gesvdj and gesvdjBatched performances are better than magma in all square matrix cases. So cusolver backend will replace magma backend when available.

When both matrix dimensions are no greater than 32, `gesvdjBatched` is used. Otherwise, `gesvdj` is used.

Detailed benchmark is available at https://github.com/xwang233/code-snippet/tree/master/svd.

Some relevant code and discussions
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/linalg/svd_op_gpu.cu.cc
- https://github.com/google/jax/blob/master/jaxlib/cusolver.cc
- https://github.com/cupy/cupy/issues/3174
- https://github.com/tensorflow/tensorflow/issues/13603
- https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2019-s9226/

See also https://github.com/pytorch/pytorch/issues/42666 https://github.com/pytorch/pytorch/issues/47953

Close https://github.com/pytorch/pytorch/pull/50516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48436

Reviewed By: ejguan

Differential Revision: D25977046

Pulled By: heitorschueroff

fbshipit-source-id: c27e705cd29b6fd7c8ac674c1f9f490fa26ee1bf
2021-01-24 15:47:05 -08:00
Xiang Gao
ab331da7ac Rewrite kron with broadcasting at::mul (#50927)
Summary:
Because it is shorter, faster, and does not have TF32 issue.

Benchmark: https://github.com/zasdfgbnm/things/blob/master/2021Q1/kron.ipynb

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50927

Reviewed By: glaringlee

Differential Revision: D26022385

Pulled By: ngimel

fbshipit-source-id: 513c9e9138c35c70d3a475a8407728af21321dae
2021-01-22 20:58:17 -08:00
Kurt Mohler
8ab1a1495d Rename set_deterministic to use_deterministic_algorithms (#49904)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49904

Reviewed By: ezyang, mrshenli

Differential Revision: D25956761

Pulled By: mruberry

fbshipit-source-id: 86a59289d50825a0ebbd7c358b483c8d8039ffa6
2021-01-22 11:27:07 -08:00
Kurt Mohler
c082e2184d Add autograd tests for complex matrix norm nuclear and +/-2 (#50746)
Summary:
Also upgrades `linalg.norm`'s autograd and jit tests to `OpInfo`

Fixes https://github.com/pytorch/pytorch/issues/48842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50746

Reviewed By: mruberry

Differential Revision: D25968246

Pulled By: anjali411

fbshipit-source-id: d457069ddb4caf2a5caed1aa64c791ef0790952c
2021-01-21 15:33:08 -08:00
Richard Zou
884fb48794 Miscellaneous batched grad testing (#50738)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50738

This PR adds batched grad testing for:
- test_linalg.py
- test_unary_ufuncs.py

Future:
- add batched grad testing for test_nn
- enable option for batched grad testing in OpInfo

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997678

Pulled By: zou3519

fbshipit-source-id: 9a9f6694c041580061bd52b5e45661c872b0b761
2021-01-21 14:26:46 -08:00
Ivan Yashchuk
f9a5ba7398 Added linalg.slogdet (#49194)
Summary:
This PR adds `torch.linalg.slogdet`.

Changes compared to the original torch.slogdet:

- Complex input now works as in NumPy
- Added out= variant (allocates temporary and makes a copy for now)
- Updated `slogdet_backward` to work with complex input

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49194

Reviewed By: VitalyFedyunin

Differential Revision: D25916959

Pulled By: mruberry

fbshipit-source-id: cf9be8c5c044870200dcce38be48cd0d10e61a48
2021-01-19 07:28:12 -08:00
Ivan Yashchuk
9384d31af5 Added linalg.pinv (#48399)
Summary:
This PR adds `torch.linalg.pinv`.

Changes compared to the original `torch.pinverse`:
 * New kwarg "hermitian": with `hermitian=True` eigendecomposition is used instead of singular value decomposition.
 * `rcond` argument can now be a `Tensor` of appropriate shape to apply matrix-wise clipping of singular values.
 * Added `out=` variant (allocates temporary and makes a copy for now)

Ref. https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48399

Reviewed By: zhangguanheng66

Differential Revision: D25869572

Pulled By: mruberry

fbshipit-source-id: 0f330a91d24ba4e4375f648a448b27594e00dead
2021-01-12 06:52:06 -08:00
Ivan Yashchuk
4774c6800b Added linalg.inv (#48261)
Summary:
This PR adds `torch.linalg.inv` for NumPy compatibility.

`linalg_inv_out` uses in-place operations on provided `result` tensor.

I modified `apply_inverse` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_inv_out` but removing the error checks and device memory synchronization.

I fixed `lda` (leading dimension parameter which is max(1, n)) in many places to handle 0x0 matrices correctly.
Zero batch dimensions are also working and tested.

Ref https://github.com/pytorch/pytorch/issues/42666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48261

Reviewed By: gchanan

Differential Revision: D25849590

Pulled By: mruberry

fbshipit-source-id: cfee6f1daf7daccbe4612ec68f94db328f327651
2021-01-10 04:00:51 -08:00
Antonio Cuni
b5ab0a7f78 Improve torch.linalg.qr (#50046)
Summary:
This is a follow up of PR https://github.com/pytorch/pytorch/issues/47764 to fix the remaining details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50046

Reviewed By: zou3519

Differential Revision: D25825557

Pulled By: mruberry

fbshipit-source-id: b8e335e02265e73484a99b0189e4cc042828e0a9
2021-01-08 09:52:31 -08:00