Thus avoiding `TypeError: 'float' object cannot be interpreted as an integer` when trying to create integer tensor from floating point values
Use `c10::checked_convert` to detect overflows during tensor construction from scalars. Modify sparse_csr test that violated this rule
Fixes#69319
Tested in #81233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81372
Approved by: https://github.com/ezyang, https://github.com/ngimel
As per title. Previously it was done via converting to COO.
A better approach could be using `dense.out_`, but `sparse_csc` is yet forbidden.
And are we fine with implementing very critical operations like `add` via transpositions?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79635
Approved by: https://github.com/cpuhrsch
Adds
- to_sparse_csc for strided input
- to_sparse_csc for COO input
- CSC to strided
- CSC to CSR
- CSC to CSC
Uses SciPy as a reference
Follow up work is changing transpose to return CSC when passed CSR and the resulting ripples through our matmul operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77521
Approved by: https://github.com/pearu, https://github.com/anjali411
This PR adds a forloop around cuSPARSE calls to support batched inputs.
cuSPARSE function itself doesn't support batched inputs yet.
`mat1` and `mat2` must have the same batch shape. It's allowed to pass
`self` as a single matrix when `mat1` and `mat2` are batched.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77243
Approved by: https://github.com/cpuhrsch
`torch.sparse.sampled_addmm` was incorrect for noncontiguous inputs on CUDA.
Unfortnately, it was overlooked in the tests that noncontiguous inputs
are not tested properly because 1x5, 5x1 shapes were used.
Block sparse triangular solver on CUDA could return incorrect results if
there's a zero on the diagonal in the sparse matrix. Now it returns nan.
Tests also revealed that unitriangular=True flag is not working
correctly on CPU in some cases. That part needs more investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76590
Approved by: https://github.com/cpuhrsch
This PR implements `torch.select` for CSR tensors. Currently, it's not possible to select rows or columns for batched CSR. The non-batched case works fine by converting to COO and calling select. Initially, I implemented raw manipulations of indices but converting to COO is only slightly slower and more readable.
This PR also enables indexing into batched CSR tensor with `[x, y, z]`. Assigning is disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76228
Approved by: https://github.com/cpuhrsch
This pull request enables accumulating gradients for the CSR tensor.
Functions that work and are tested:
- tensor.abs()
- tensor.neg()
- tensor.conj_physical()
- torch.addmm
`torch.mm` also works, but tests will be added later.
In addition, this PR adds throwing an error when trying to access strides, storage, and contiguity info on a CSR tensor.
`tensor.to_sparse_csr().to_sparse_csr()` was failing and now fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75435
Approved by: https://github.com/cpuhrsch
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73642
Former https://github.com/pytorch/pytorch/pull/73471 that was reverted due to lack of `to_sparse(sparse_dim)` support.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D34580353
Pulled By: cpuhrsch
fbshipit-source-id: a8a4ea381daeb80d8365fe931af9f55a7e789ea1
(cherry picked from commit 5a3cf8110980e5a10dbb687e87e67d5524ebf2f5)
Summary:
This PR introduces the `cuSolverSP` backend for `linalg.solve` with sparse CSR input matrices. The motivation comes from the issue: https://github.com/pytorch/pytorch/issues/69538.
`cuSolver` provides [`cusolverSp<t>csrlsvluHost`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu) API, a few things to note:
1. As mentioned in the documentation: `only CPU (Host) path is provided.` From the profiling, there doesn't seem to be any GPU kernel launch for optimization, please see the profiling below.
2. Since only `host` path is provided, the CPU path uses `csrlsvluHost` (but requires PyTorch to be installed/built with CUDA support).
3. The documentation mentions reordering helps optimize stuff, but it isn't clear how it affects the performance. There are options for reordering, so we stick to `reorder = 0` as the default choice.
`cuSolver` has [`csrlsvqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr) function which provides a `device` path to solve the linear system. This function is used for the CUDA path in this PR.
**Gist:**
For CPU Path: we call [`csrlsvluHost` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu).
For CUDA Path: we call [`csrlsvqr` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr).
**Profiling:** (On sparse input tensor of size 1000 x 1000, with a vector of shape length 1000), for `csrlsvlu` function (to show no GPU optimization)
```cpp
==3999651== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.1440us 1 2.1440us 2.1440us 2.1440us [CUDA memcpy HtoD]
API calls: 99.72% 1.07199s 9 119.11ms 500ns 1.07164s cudaFree
0.11% 1.2182ms 398 3.0600us 140ns 137.94us cuDeviceGetAttribute
0.06% 674.45us 4 168.61us 165.50us 173.64us cuDeviceTotalMem
0.03% 357.07us 4 89.268us 2.7800us 201.89us cudaMalloc
0.03% 309.29us 1 309.29us 309.29us 309.29us cudaGetDeviceProperties
0.01% 160.47us 332 483ns 350ns 3.3300us cudaFuncSetAttribute
0.01% 115.12us 4 28.780us 26.290us 33.410us cuDeviceGetName
0.00% 28.591us 5 5.7180us 440ns 16.921us cudaGetDevice
0.00% 22.061us 4 5.5150us 871ns 18.690us cudaDeviceSynchronize
0.00% 20.370us 18 1.1310us 410ns 6.9900us cudaEventDestroy
0.00% 16.390us 1 16.390us 16.390us 16.390us cudaMemcpy
0.00% 11.540us 2 5.7700us 1.4900us 10.050us cuDeviceGetPCIBusId
0.00% 10.510us 18 583ns 430ns 1.6200us cudaEventCreateWithFlags
0.00% 7.9100us 21 376ns 290ns 700ns cudaDeviceGetAttribute
0.00% 1.4300us 6 238ns 150ns 590ns cuDeviceGet
0.00% 1.2200us 4 305ns 190ns 500ns cuDeviceGetCount
0.00% 900ns 1 900ns 900ns 900ns cuInit
0.00% 860ns 4 215ns 180ns 260ns cuDeviceGetUuid
0.00% 240ns 1 240ns 240ns 240ns cuDriverGetVersion
0.00% 230ns 1 230ns 230ns 230ns cudaGetDeviceCount
```
Script:
```python
import torch
def solve(x, other, out):
torch.linalg.solve(x, other, out=out)
if __name__ == "__main__":
dense_inp = torch.randn((1000, 1000), dtype=torch.float64)
# Set 50% of the values to 0 randomly
dense_inp = torch.nn.functional.dropout(dense_inp, p=0.5)
sparse_inp = dense_inp.to_sparse_csr()
other = torch.randint(100, (1000,), dtype=torch.float64)
out = torch.randint(1, (1000,), dtype=torch.float64)
solve(sparse_inp, other, out)
```
The following error is raised when the function is used on a CPU device with PyTorch built/installed without CUDA support:
* When built without CUDA support:
```python
/home/krshrimali/pytorch/torch/autograd/profiler.py:151: UserWarning: CUDA is not available, disabling CUDA profiling
warn("CUDA is not available, disabling CUDA profiling")
Traceback (most recent call last):
File "/home/krshrimali/pytorch/test_sp.py", line 17, in <module>
solve(x, other, out)
File "/home/krshrimali/pytorch/test_sp.py", line 5, in solve
torch.linalg.solve(x, other, out=out)
RuntimeError: PyTorch was not built with CUDA support. Please use PyTorch built CUDA support
```
**Performance Comparison** (vs SciPy's [`scipy.sparse.linalg.spsolve`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.spsolve.html):
Time taken by `scipy.sparse.linalg.spsolve` : 0.595 seconds
On CPU: Time taken by `torch.linalg.solve` : 4.565 seconds
On CUDA: Time taken by `torch.linalg.solve`: 1.838 seconds
The inputs are of dimensions: (17281, 17281) and (17281, 1), and were taken from https://math.nist.gov/MatrixMarket/extreme.html.
Thanks to IvanYashchuk for helping me with the PR, and guiding me through it.
cc: IvanYashchuk pearu nikitaved cpuhrsch
cc nikitaved pearu cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71399
Reviewed By: VitalyFedyunin
Differential Revision: D33767740
Pulled By: cpuhrsch
fbshipit-source-id: a945f065210cd719096eb8d7cdbf8e8937c2fce9
(cherry picked from commit f4f35c17da414e1ca6c6d91402933521857aa1ea)
Summary:
When PyTorch is not built with MKL or on Windows there's a native implementation of `torch.addmm` for tensors on CPU. There was a bug that `beta` value was ignored, causing new tests to fail (see https://github.com/pytorch/pytorch/pull/71949#issuecomment-1024639741).
In addition, I also enabled complex numbers support for this code path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72430
Reviewed By: davidberard98
Differential Revision: D34045670
Pulled By: cpuhrsch
fbshipit-source-id: b2b63f22ba3eea895a31c5c2925b0fb1555d2c6f
(cherry picked from commit ac0a2080bb)
Summary:
Rest of the tests from CUDA testuite is skipped after GPU context corruption is encountered.
For tests decorated with `expectedFailure` creates false impression that entire testsuite is passing.
Remedy it by suppressing the exception and printing the warning about unexpected success if `should_stop_early` is true
Also, prints warning when this happens (to make attribution easier) as well as when this condition is detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72016
Test Plan:
`python test_ops.py -v -k test_fn_fwgrad_bwgrad_gradient`
Before the change:
```
test_fn_fwgrad_bwgrad_gradient_cpu_complex128 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cpu_float64 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cuda_complex128 (__main__.TestGradientsCUDA) ... expected failure
----------------------------------------------------------------------
Ran 3 tests in 0.585s
OK (expected failures=1)
```
After the change:
```
test_fn_fwgrad_bwgrad_gradient_cpu_complex128 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cpu_float64 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cuda_complex128 (__main__.TestGradientsCUDA) ... /home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:1670: UserWarning: TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
warn(f"TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with {rte}")
/home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py:382: UserWarning: Suppressed expected failure that resulted in fatal error
warn("Suppressed expected failure that resulted in fatal error")
unexpected success
----------------------------------------------------------------------
Ran 3 tests in 0.595s
FAILED (unexpected successes=1)
```
And `stderr` from XML file contains requested info:
```
/home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:1670: UserWarning: TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
warn(f"TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with {rte}")
/home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py:382: UserWarning: Suppressed expected failure that resulted in fatal error
warn("Suppressed expected failure that resulted in fatal error")
```
Fixes https://github.com/pytorch/pytorch/issues/71973
Reviewed By: janeyx99, ngimel
Differential Revision: D33854287
Pulled By: malfet
fbshipit-source-id: dd0f5a4d2fcd21ebb7ee50ce4ec4914405a812d0
(cherry picked from commit 0c0baf3931)
Summary:
Since there is no rule in PyTorch (Sparse CSR) for filling zeros, it was decided that only those ops will be supported which do not break 0->0 correspondence. To ensure that this rule is not broken, this PR aims to add a test to ensure this rule is not broken.
`sample_inputs_unary` may or may not generate a zero in the sample input. Hence, this separate test is good for validating the rule, and the support for Sparse CSR.
cc nikitaved pearu cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70302
Reviewed By: albanD
Differential Revision: D33922501
Pulled By: cpuhrsch
fbshipit-source-id: 10f67a220b95a8e75205345a33744ad536fdcf53
(cherry picked from commit ade9bf7818)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68711
This PR adds possibility to multiply a single CSR matrix by a batch of dense matrices.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: davidberard98
Differential Revision: D33773319
Pulled By: cpuhrsch
fbshipit-source-id: 1623ce9affbc4fdc6d6130a95c5a42022858b62b
(cherry picked from commit 628c8e366d)
Summary:
This PR enables `test_block_triangular` tests on the CPU.
These tests revealed that there was a problem with how the nnz==0 case is handled. Now we return a tensor filled with NaNs both on CUDA and CPU.
cc nikitaved pearu cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71304
Reviewed By: davidberard98
Differential Revision: D33600482
Pulled By: cpuhrsch
fbshipit-source-id: d09cb619f8b6e54b9f07eb16765ad1c183c42487
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68083
This PR adds support for `torch.randn_like(sparse_csr_tensor)`.
It creates a new sparse csr tensor with same indices but different values that are normally distributed.
In addition `.normal_()` and `torch.empty_like` were implemented because `randn_like` is a composite of these two functions.
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D33511280
Pulled By: cpuhrsch
fbshipit-source-id: 6129083e8bc6cc5af2e0191294bd5e4e864f6c0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68710
This PR adds support for block sparse (BSR) matrices for functions that
use Inspector-Executor MKL Sparse API. At the moment of this PR it's:
* torch.addmm
* torch.addmv
* torch.triangular_solve (once https://github.com/pytorch/pytorch/pull/62180 is merged)
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D33179486
Pulled By: cpuhrsch
fbshipit-source-id: e1dec0dccdbfed8b280be16b8c11fc9e770d50ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68709
This PR adds support for triangular solver with a block CSR matrix.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33066067
Pulled By: cpuhrsch
fbshipit-source-id: 9eaf1839071e9526be8d8c6d47732b24200f3557
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68007
This PR adds a new function to the sparse module.
`sampled_addmm` computes α*(A @ B) * spy(C) + β*C, where C is a sparse CSR matrix and A, B are dense (strided) matrices.
This function is currently restricted to single 2D matrices, it doesn't support batched input.
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D32435799
Pulled By: cpuhrsch
fbshipit-source-id: b1ffac795080aef3fa05eaeeded03402bc097392
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68707
This PR adds a path for block CSR matrices for `torch.addmm`. cuSPARSE interface is restricted to 32-bit indices and square blocks.
My plan is to make everything work and tests passing using an unsafe constructor first, keeping it all private. Then discuss & implement constructors with block information separately unlocking the functions for wider use. Documentation will come with the update to constructors.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D32650366
Pulled By: cpuhrsch
fbshipit-source-id: 430a9627901781ee3d2e2496097b71ec17727d98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68707
This PR adds a path for block CSR matrices for `torch.addmm`. cuSPARSE interface is restricted to 32-bit indices and square blocks.
My plan is to make everything work and tests passing using an unsafe constructor first, keeping it all private. Then discuss & implement constructors with block information separately unlocking the functions for wider use. Documentation will come with the update to constructors.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D32633806
Pulled By: cpuhrsch
fbshipit-source-id: b98db0bd655cce651a5da457e78fca08619a5066
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62180
This PR adds CPU dispatch for `triangular_solve` with sparse CSR matrix.
The implementation uses MKL Sparse library. If it's not available then a runtime error is thrown.
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D32581395
Pulled By: cpuhrsch
fbshipit-source-id: 41c7133a0d2754ef60b5a7f1d14aa0bf7680a844
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61536
This PR adds CPU dispatch for `addmv_out` with Sparse CSR matrix.
The implementation uses MKL Sparse library. If it's not available then a
runtime error is thrown.
Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated.
MKL descriptor of sparse matrices is implemented in `at::mkl::sparse::MklSparseCsrDescriptor`.
MKL Sparse doesn't allow switching indices type in runtime, it's
predetermined in build time. Only 32-bit version of MKL was tested
locally, but I expect 64-bit version to work correctly as well.
When indices type of PyTorch CSR tensor doesn't match with MKL's,
indices tensor is converted to MKL compatible type (`int` vs `int64_t`).
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D32141787
Pulled By: malfet
fbshipit-source-id: b818a0b186aa227982221c3862a594266a58a2a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66401
This PR fixes the case when result and input tensors have different
strides.
cuSPARSE from CUDA 11.3.1 has a bug: it doesn't use correct strides to
write the result. This is "fixed" in PyTorch code by copying the input
tensor to a tensor with same strides as result tensor has.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: davidberard98
Differential Revision: D32177966
Pulled By: cpuhrsch
fbshipit-source-id: 118437409df147f04dce02763aff9bfd33f87c63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63948
This PR adds `torch.add(a, b, alpha=None, out=out)` variant with `a, b,
out` all being sparse CSR tensors.
The underlying cuSPARSE function works only with 32-bit indices, and in
the current implementation, the result tensor has 32-bit indices. Input
tensors can have both 64-bit and 32-bit indices tensors.
Fixes https://github.com/pytorch/pytorch/issues/59060
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D31909731
Pulled By: cpuhrsch
fbshipit-source-id: 656f523e3947fec56b2f93c474fb6fd49f0360ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61858
This PR adds `triangular_solve_out_sparse_csr_cuda`. The operation is
used to comput the solution to the linear system where coefficient
matrix is triangular.
Structured kernels are used and the meta function needed some changes to
support sparse csr layout. With sparse matrix input the `cloned_coefficient`
tensor is 0-sized tensor.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D31948435
Pulled By: cpuhrsch
fbshipit-source-id: 7775fece83ca705a26d75f82aead10b956b14bfd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63511
This PR adds `torch.addmm(c, a, b)` variant with `c, a, b` all being CSR tensors.
The underlying cuSPARSE function works only with 32-bit indices, and in
the current implementation the result tensor has 32-bit indices. Input
tensors can have both 64-bit and 32-bit indices tensors.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D31809838
Pulled By: cpuhrsch
fbshipit-source-id: 97005dba27d8adcae445eb756bcbd7271061e9b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63510
Sparse CSR matrix resizing behavior:
If we _increase the number of rows_ the number of specified elements in the matrix remains the same -> the size of col_indices, values doesn't change, the size of crow_indices becomes `rows+1`.
If we _decrease the number of rows_ the number of specified elements will be `min(nnz, rows*cols)` -> need to resize `crow_indices` to `rows+1` and set the last element to `min(nnz, rows*cols)`; decrease the size of col_indices and values to `min(nnz, rows*cols)`.
If we _increase the number of columns_ the number of specified elements in the matrix remains the same, the number of rows remains the same -> no need to resize anything, just set new sizes.
We _cannot decrease the number of columns_ because it would require recomputing `crow_indices`.
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D31796680
Pulled By: cpuhrsch
fbshipit-source-id: 7d8a9701ce06d30a1841f94bba0a057cacea9401
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63509
The primary use of `torch.empty` is to reserve memory for tensor and set the type, device, size information. The same is done here for SparseCSR.
`crow_indices` is initialized as an empty tensor of size `num_rows + 1`. `col_indices` and `values` are initialized as empty tensors of size 0.
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D31770359
Pulled By: cpuhrsch
fbshipit-source-id: c83f2a2e0d7514ba24780add1086e1bccf541dd9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66485
The errors for incorrectly sized inputs should match the dense variants
of functions.
Moved addmm_out_sparse_csr_dense_cuda from SparseCsrTensorMath.cu and
removed unnecessary device check.
cc nikitaved pearu cpuhrsch IvanYashchuk
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D31764036
Pulled By: cpuhrsch
fbshipit-source-id: 76900fe9e4a49474695a01f34bad41cb3422321c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.
cc nikitaved pearu cpuhrsch IvanYashchuk ngimel
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D31584499
Pulled By: ngimel
fbshipit-source-id: 4c507791471ada88969116b88eeaaba7a7536431
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60838
Rewrote `addmm_out_sparse_csr_dense_cuda` implementation using new cusparse descriptors.
`addmm` now works without conversions with both 32-bit and 64-bit indices.
The dense tensors can have a row- or column-major layout. If the dense tensors are a contiguous slice of a larger tensor, the storage is used directly without temporary copies.
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D30643191
Pulled By: cpuhrsch
fbshipit-source-id: 5555f5b59b288daa3a3987d322a93dada63b46c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63554
Following https://github.com/pytorch/pytorch/pull/61840#issuecomment-884087809, this deprecates all the dtype getters publicly exposed in the `torch.testing` namespace. The reason for this twofold:
1. If someone is not familiar with the C++ dispatch macros PyTorch uses, the names are misleading. For example `torch.testing.floating_types()` will only give you `float32` and `float64` skipping `float16` and `bfloat16`.
2. The dtype getters provide very minimal functionality that can be easily emulated by downstream libraries.
We thought about [providing an replacement](https://gist.github.com/pmeier/3dfd2e105842ad0de4505068a1a0270a), but ultimately decided against it. The major problem is BC: by keeping it, either the namespace is getting messy again after a new dtype is added or we need to somehow version the return values of the getters.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D30662206
Pulled By: mruberry
fbshipit-source-id: a2bdb10ab02ae665df1b5b76e8afa9af043bbf56
Summary:
The `ops` decorator provides a way to parameterize a test across a given list of ops. This would be useful for modules as well (e.g. a `modules` decorator), but the mechanism by which this is accomplished is specific to ops. In the details, the `ops` decorator tags a test function with the metadata needed (list of ops, `dtypes`) and the actual tests are generated according to this metadata during the call to `instantiate_device_type_tests()`.
This PR makes this mechanism more generic, allowing for test parameterization across arbitrary dimensions. This makes a `modules` decorator (or any similar type of decorator) straightforward to implement without changes to the device-specific test instantiation logic.
One caveat is that, since this is implemented where the old `ops` decorator was (within `instantiate_device_type_tests()`), this only works for tests instantiated using the device-specific instantiation logic. Longer term, even device-specific test instantiation could be treated as an optional parameterization across device types, but this PR takes a low-risk approach for now. In practice, this just means that a `device` kwarg is required for all test signatures used with the mechanism.
The `ops` decorator has been refactored to use the generic mechanism and works the same as before, with one difference: when `OpDTypes.none` is specified, the test signature no longer needs an unused `dtype` kwarg. This is a nice bonus that demonstrates the added flexibility of a generic parameterization mechanism. The refactored form also has the bonus that all op-specific test generation logic is contained within the `ops` decorator class, improving readability.
Behind the scenes, the generic mechanism is a base decorator class (`_TestParameterizer`) from which `ops` derives. The core functionality is in the `_parameterize_test()` method, which takes in a test function and returns a generator that produces parameterized tests, including names and parameter kwargs to pass to them. Using the `ops` decorator results in a set of op-specific tests from a given generic test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60233
Reviewed By: iramazanli
Differential Revision: D29494995
Pulled By: jbschlosser
fbshipit-source-id: a14446488c106094fafcaa75ccf8e9e3faf33bfc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60656
This PR uses `torch.testing.get_all_dtypes()` for dtype parametrisation
of tests in `test_sparse_csr.py`. It adds previously excluded from tests
bool, half, bfloat16, complex dtypes. `torch.complex32` is omitted due
to lack of coverage and lack of specialized `AT_DISPATCH...`.
The process of adding more dtypes to tests releaved that `.to_dense()`
doesn't work for all dtypes.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D29408058
Pulled By: cpuhrsch
fbshipit-source-id: 319b6f51b9786d6957d508f51657657a6d00267a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58768
Fixes gh-58757
This PR has a fix for CPU version of addmm op. Just for context, before this PR, only CSR @ vector was supported. I found out a minor bug in the addmm_out_sparse_csr_dense_cpu for the non MKL code which is solved in this PR.
Moreover, I discovered a limitation in the current MKL implementation. It only works well (acceptable tolerance for output error) with square matrices. I was looking in deep to this issue and I found out that it could be a limitation of the MKL API.
I used this [gist code](https://gist.github.com/aocsa/0606e833cd16a8bfb7d37a5fbb3a5b14) based on [this](https://github.com/baidu-research/DeepBench/blob/master/code/intel/spmm/spmm_bench.cpp) to test this behavior.
As you can see there is not an acceptable output error (last column) when the matrices are squares and there is a not acceptable error when the matrices are not square. I reported the issue here: https://github.com/pytorch/pytorch/issues/58770
Looking forward to your comments.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D28629563
Pulled By: malfet
fbshipit-source-id: 5ee00ae667336e0d9301e5117057213f472cbc86
Summary:
fixes https://github.com/pytorch/pytorch/issues/58632.
Added several skips that relates to test assert and MKL. Will address them in separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58666
Reviewed By: seemethere, janeyx99
Differential Revision: D28607966
Pulled By: walterddr
fbshipit-source-id: 066d4afce2672e4026334528233e69f68da04965
Summary:
`NULL` return from `PyObject_GetAttrString` should never get ignored without handling the exception, as behavior of subsequent Python C API calls are undefined until `PyErr_Fetch` or `PyErr_Clear` is called.
This accidentally leads to `list` type being incorrectly identified as `Tensor`
Fixes https://github.com/pytorch/pytorch/issues/58520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58631
Reviewed By: albanD
Differential Revision: D28559454
Pulled By: malfet
fbshipit-source-id: 46f044b5f0f94264779a6108474d04a8ba851c53