Commit Graph

294 Commits

Author SHA1 Message Date
Pearu Peterson
419f2ca3e3 Fix a crash in sparse compressed tensor invariants check when nnz == 0 (#115825)
Fixes python crash example from https://github.com/pytorch/pytorch/issues/115755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115825
Approved by: https://github.com/cpuhrsch
2023-12-17 17:36:15 +00:00
Pearu Peterson
32286512cc Add tune_bsr_dense_addmm as an API to find optimal triton kernel parameters for bsr_dense_addmm (#115499)
As in the title.

In addition:
- improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found
- add tests and fix bugs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499
Approved by: https://github.com/cpuhrsch
2023-12-12 16:44:51 +00:00
PyTorch MergeBot
d7180161b5 Revert "[SparseCsr] Remove triton sdpa skip after triton pin update (#109601)"
This reverts commit f64b10803f.

Reverted https://github.com/pytorch/pytorch/pull/109601 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing in trunk with this error ZeroDivisionError: integer division or modulo by zero ([comment](https://github.com/pytorch/pytorch/pull/109601#issuecomment-1847784383))
2023-12-08 20:12:53 +00:00
Peter Bell
f64b10803f [SparseCsr] Remove triton sdpa skip after triton pin update (#109601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109601
Approved by: https://github.com/desertfire, https://github.com/amjames
2023-12-08 15:49:16 +00:00
Alexander Grund
ca15671c30 Fix failing test_invalid_input_csr_large (#114940)
The test introduced in #102530 has a bug:
Construction of `crow_indices` raises an exception: "value cannot be converted to type int32 without overflow" which is obviously correct.
This makes the test fail which is supposed to check for an overflow in nnz.
Fix by making the construction of `crow_indices` pass although with an invalid value which would error later but triggers the correct check.

Given that I'm not sure it is even worth checking for an overflow in nnz:
- `crow_indices[..., -1] == nnz` is already enforced
- this can only hold if `crow_indices` is able to hold `nnz` without overflow
- `col_indices` has to be of the same type as `crow_indices`
- Hence the type of `col_indices` has to be able to hold the value of `nnz`

So in conclusion: The situation being checked for cannot reasonably occur

CC @pearu as the test author for additional insight

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114940
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-12-08 11:55:21 +00:00
Pearu Peterson
12085914b8 Replace bsr_dense_mm triton kernel with bsr_dense_addm triton kernel (#115030)
The `bsr_dense_addmm` triton kernel introduced in https://github.com/pytorch/pytorch/pull/114595 is a generalization of `bsr_dense_mm` triton kernel and a more efficient version of it because it uses an extra kernel parameter `SPLIT_N` that has notable effect to performance for r.h.s operand with a larger number of columns.

This PR eliminates the `bsr_dense_mm` triton kernel in favor of using `bsr_dense_addmm` triton kernel.

The performance increase of `bsr_dense_mm` is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 50/71 %
- with 32x32 blocks, the average/maximal speed up is 30/63 %
- with 64x64 blocks, the average/maximal speed up is 12/26 %
- with 128x128 blocks, the average/maximal speed up is 7/17 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115030
Approved by: https://github.com/cpuhrsch
2023-12-05 22:29:24 +00:00
Pearu Peterson
4ba37e1804 Add tests for bsr_dense_addmm and bsr_dense_mm triton kernels (#114800)
As in the title.

In addition,
- resolve https://github.com/pytorch/pytorch/pull/114757#discussion_r1409547917 re triton-contiguous inputs
- support non-contiguous inputs and outputs in triton kernels
- fix a couple of minor bugs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114800
Approved by: https://github.com/cpuhrsch
2023-12-04 22:07:47 +00:00
Jason Ansel
9664190952 [dynamo] Eagerly install guards (#111415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111415
Approved by: https://github.com/voznesenskym
ghstack dependencies: #111306
2023-11-07 19:55:19 +00:00
Andrew M. James
0bd2955f15 Memory leak from bsr_scatter_mm_indices_data argument cache (#112301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112301
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-11-02 18:43:10 +00:00
Pearu Peterson
cf6041e942 Use weakref in storing tensors as keys (follow-up to #111470) (#112076)
This PR addresses the discussion items in https://github.com/pytorch/pytorch/pull/111470#discussion_r1369008167, that is,
- use weakref when storing tensors as keys,
- add `storage_offset` to the key data,
- and revise the description of the `TensorAsKey` utility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112076
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #112154
2023-10-30 19:16:05 +00:00
Pearu Peterson
b969c675f5 Add batched dimensions support to the second operand of bsr_scatter_mm (#111796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111796
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110396, #111470, #111489, #111760
2023-10-23 23:52:49 +00:00
Pearu Peterson
d4708a6da7 Add scatter_mm and bsr_scatter_mm operations. (#110396)
This PR introduces `scatter_mm` operation (compute `mm` of arbitrary pairs of tensors given in batches of tensors) that is used to implement `bsr_scatter_mm` that is equivalent to `bsr_dense_mm` (the `mm` operation on bsr and strided tensors). The implementation is provided both in Triton (when tensor dimensions are multiples of 16) and in PyTorch (otherwise).

The figures below illustrate the performance differences of `bsr_scatter_mm` and `bsr_dense_mm` (GPU: `NVIDIA GeForce RTX 2060 SUPER`). The first figure represents the performance equilibrium point in BSR tensor sparsity at which value `bsr_scatter_mm` or `bsr_dense_mm` have the same performance characteristics as `torch.matmul`. The second figure represents speedups from using `bsr_scatter_mm` at its performance equilibrium points with respect to `bsr_dense_mm`.

<img src="https://github.com/pytorch/pytorch/assets/402156/526d182e-937f-4812-a6c4-904f52d6d5ab" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/ccb606ab-1f3f-4133-887c-b56285f4f168" width="48%">

The same figures for GPU card `NVIDIA A100-SXM4-80GB`:

<img src="https://github.com/pytorch/pytorch/assets/402156/25466f1d-df34-4d1c-a975-afb478e4d9f0" width="48%"> <img src="https://github.com/pytorch/pytorch/assets/402156/6ada91f0-a20f-4f0d-8a48-1f4ccc60d08e" width="48%">

In sum:
- `bsr_scatter_mm` is about 2x faster than `bsr_dense_mm` for small block sizes of 16 and 32 and large tensors [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- `bsr_scatter_mm` is up to 2x faster than `bsr_dense_mm` for small block sizes of 16 and large tensors [GPU: `NVIDIA A100-SXM4-80GB`].
- `bsr_dense_mm` is up to 20 % faster than `bsr_scatter_mm` for block sizes of 64 or larger [GPU: `NVIDIA GeForce RTX 2060 SUPER`].
- However, `bsr_dense_mm` fails with `OutOfResources` exception for block sizes of 256 or larger whereas `bsr_scatter_mm` succeeds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110396
Approved by: https://github.com/cpuhrsch
2023-10-23 19:45:30 +00:00
Evgeni Burovski
48989bc820 trace frames with np.ndarray (#110512)
Fixes #109604

Resubmit gh-109715 + several skips and small fixes to make tests pass.

The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break.
While at it, fix several small issues Yukio's fix uncovers:

- graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc)
- recognize array scalars in dynamo, treat them as 0D ndarrays
- make sure that iterating over torch.ndarray generates arrays not bare tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512
Approved by: https://github.com/lezcano
2023-10-15 00:56:10 +00:00
Oguz Ulgen
1df14f1bf8 Move has_triton to top level triton utils so that dynamo can also access (#109832)
it without creating cyclic dependencies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832
Approved by: https://github.com/zou3519
2023-09-22 19:33:41 +00:00
Shunting Zhang
e68b3ad14f update triton pin with needed inductor change (#107722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107722
Approved by: https://github.com/jansel, https://github.com/cpuhrsch
2023-08-29 04:31:44 +00:00
Pearu Peterson
d7c0c5de2d Set crow_indices outputs as non-differentiable. (#107447)
Fixes https://github.com/pytorch/pytorch/issues/107083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107447
Approved by: https://github.com/cpuhrsch
2023-08-21 19:52:32 +00:00
rraminen
239578beff [ROCm] Enable a few bfloat16 unit tests (#105177)
Currently a few unit tests from **test_matmul_cuda** and **test_sparse_csr** test suites are being skipped on ROCm.

This PR is to enable the following unit tests on ROCm (~30 UTs):

test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA)
test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-03 21:17:19 +00:00
yanbing-j
a54043516f Add SparseCsrCPU and SparseCsrCUDA dispatch to sum.dim_IntList (#99292)
This PR is to add support of sum.dim_IntList for Sparse Tensor, which is exposed in https://github.com/pytorch/pytorch/issues/98796.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99292
Approved by: https://github.com/mingfeima, https://github.com/rusty1s, https://github.com/cpuhrsch
2023-07-24 17:30:58 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
nikitaved
44c8515d0d SDPA: frontend for BSR masks (#104042)
This PR implements a (yet private) frontend for scaled_dot_product_attention that works with BSR `attn_mask`.

This function is directly comparable (with suitable masks) with `torch.nn.functional.scaled_dot_product_attention` once `attn_mask.dtype == torch.bool`, but it's behavior is different when `attn_mask.dtype != torch.bool`. This is because `torch.nn.functional.scaled_dot_product_attention` assumes that irrelevant values are supposed to be filled with `-inf`, while the selected ones should be `0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104042
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-07-13 18:01:21 +00:00
yanbing-j
053654b9cf Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-13 09:34:29 +00:00
PyTorch MergeBot
f8aedf1efe Revert "Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)"
This reverts commit da7675621e.

Reverted https://github.com/pytorch/pytorch/pull/103427 on behalf of https://github.com/clee2000 due to sorry but it looks like this pr broke test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_expanded_index_cpu_bfloat16 on periodic parallelnative testing da7675621e https://github.com/pytorch/pytorch/actions/runs/5477783108/jobs/9977608393 ([comment](https://github.com/pytorch/pytorch/pull/103427#issuecomment-1624008753))
2023-07-06 17:02:03 +00:00
yanbing-j
da7675621e Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-06 01:23:56 +00:00
Andrew M. James
5364366f8c Sparse Compressed mm avoid creating temp sparse (#104062)
When mm forwards to addmm it creates a zeroed out self this tensor
should take options from the result not one of the sparse arguments.

The bug was leading to an error when calling linear with an `out` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104062
Approved by: https://github.com/nikitaved, https://github.com/pearu
2023-06-26 16:45:04 +00:00
Aleksandar Samardžić
09fdea8564 Fix autograd issue with identity conversions (#92022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92022
Approved by: https://github.com/pearu, https://github.com/mtaaooby, https://github.com/amjames, https://github.com/cpuhrsch
2023-06-21 21:23:03 +00:00
Nikita Vedeneev
39a22e2791 softmax: Triton kernel for BSR inputs (#102095)
Implements `softmax` Triton kernel for BSR inputs. So far, only over `dim=-1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102095
Approved by: https://github.com/cpuhrsch
2023-06-21 01:23:27 +00:00
Pearu Peterson
cbe270d233 Fix zeros_like for sparse tensors with batch dimensions. Add opinfo-based tests to like-functions. (#101215)
Fixes #101078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101215
Approved by: https://github.com/cpuhrsch
2023-06-13 16:02:10 +00:00
Xiao Wang
6340aa5d58 Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR [v2] (#102660)
Test was originally skipped in https://github.com/pytorch/pytorch/pull/98462

Not sure why it was removed in https://github.com/pytorch/pytorch/pull/94825

Now the test hits CUDA illegal memory access on H100 again after https://github.com/pytorch/pytorch/pull/101163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102660
Approved by: https://github.com/zou3519
2023-06-01 20:36:45 +00:00
Pearu Peterson
9f97b7c43b Add integer overflow checks for large compressed tensor dimensions and nnz (#102530)
With the previous PR allowing large compressed tensors (dimensions larger than `2 ** 31 - 1`), sparse compressed tensor invariants checks may give false-positive results:
```python
>>> nnz=2**31
>>> torch.sparse.check_sparse_tensor_invariants.enable()
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.zeros(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
tensor(crow_indices=tensor([          0,           1,           2,  ...,
                             2147483646,  2147483647, -2147483648]),
       col_indices=tensor([0, 0, 0,  ..., 0, 0, 0]),
       values=tensor([1., 1., 1.,  ..., 1., 1., 1.]), size=(2147483648, 1),
       nnz=2147483648, layout=torch.sparse_csr)
```
(notice that the last entry in `crow_indices` is invalid) or raise a bogus exception as in
```python
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.arange(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: `0 <= col_indices < ncols` is not satisfied.
```
(notice that `col_indices` is actually valid).

This PR fixes the above-reported bugs by introducing integer overflow checks for sparse compressed tensors dimensions as well as nnz.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102530
Approved by: https://github.com/nikitaved
2023-05-31 15:34:08 +00:00
Nikita Vedeneev
d80d3b18d0 nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08f7a6a</samp>

This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403
Approved by: https://github.com/malfet, https://github.com/cpuhrsch
2023-05-31 13:09:45 +00:00
Pearu Peterson
fcbdbd6682 Fix silent nnz overflow for large sparse compressed tensors. (#102523)
Fixes https://github.com/pytorch/pytorch/issues/102520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102523
Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch
2023-05-30 16:58:01 +00:00
Nikita Vedeneev
6c7410ddc3 sampled_addmm: BSR support (#101163)
This PR implements a `sampled_addmm` kernel that works with a BSR mask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101163
Approved by: https://github.com/cpuhrsch
2023-05-25 12:33:50 +00:00
Nikita Vedeneev
346e1f512f sparse compressed validation: allow empty-batched inputs (#101180)
Fixes https://github.com/pytorch/pytorch/issues/101179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101180
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-05-11 20:30:20 +00:00
Nikita Vedeneev
dd2c22f4bb bsr_dense_bmm(): enable more precise float32 support with float64 accumulators (#100882)
Float64 is there in Triton! This PR increases precision for float32 inputs with float64 accumulation dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100882
Approved by: https://github.com/cpuhrsch
2023-05-11 11:22:55 +00:00
Pearu Peterson
92a7640b76 Add mul tests with sparse sample inputs (#100393)
This PR implements sparse sample inputs and error inputs for mul OpInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100393
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-05-09 16:13:14 +00:00
Nikita Vedeneev
0141a242fd bsr_dense_bmm(): remove sparse_rowspace kernel and some dead code (#100876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100876
Approved by: https://github.com/cpuhrsch, https://github.com/Skylion007
2023-05-09 16:12:11 +00:00
Nikita Vedeneev
c4bc259f00 bsr_dense_mm(): better test coverage (#100543)
This PR improves test coverage for `bsr_dense_mm` by:
- ~~enabling correctness tests for `float32`~~.
- extending and testing input correctness checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100543
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-05-09 09:26:02 +00:00
Pearu Peterson
3ae0e23b90 Fix sum OpInfo for sparse sample inputs and assert coverage for sparse-enabled operators (#100391)
This PR enables sum tests for sparse sample inputs. Previously, the tests existed but were never run because the sum OpInfo instance was created without specifying `supports_sparse_*=True`. To avoid such mistakes in the future, the following PR https://github.com/pytorch/pytorch/pull/100392 enables the `supports_sparse_*` flags automatically when OpInfo creation specifies `sample_inputs_sparse_*_func`.

In addition, the PR applies several fixes to sum tests for sparse sample inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100391
Approved by: https://github.com/cpuhrsch
2023-05-03 02:04:39 +00:00
Nikita Vedeneev
1adb6fa922 nn.Linear: dispatch to bsr_dense_mm for half and bfloat16 (#94825)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94825
Approved by: https://github.com/albanD, https://github.com/cpuhrsch
2023-04-15 13:38:42 +00:00
Xiao Wang
bd83b205cc Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR (#98462)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98462
Approved by: https://github.com/zou3519
2023-04-10 21:21:06 +00:00
eqy
2fddcf0fc0 [CUDA][CUDA 11] Remove more CUDA 11 version checks (#92934)
Working on removing stragglers missed in previous CUDA version < 11.0 cleanup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92934
Approved by: https://github.com/ngimel
2023-03-30 19:49:52 +00:00
Aaron Gokaslan
47dca20d80 [BE] Enable flake8-comprehension rule C417 (#97880)
Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880
Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD
2023-03-30 14:34:24 +00:00
Sergii Dymchenko
5ab50cf048 Fix shoud/shoudl typos (#97930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97930
Approved by: https://github.com/clee2000
2023-03-30 08:27:16 +00:00
Nikita Shulga
2c16b73a1b Remove comma from parametrized test name (#97844)
Using `name_fn` argument of `@paramterize` decorator.

As internal test runner can't figure out how to parse those, otherwise this is a no-op.

For those with intern access, see [T149211516](https://www.internalfb.com/intern/tasks/?t=149211516)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97844
Approved by: https://github.com/weiwangmeta
2023-03-29 14:20:13 +00:00
Nikita Shulga
b443198966 Fix sparse addmv ref impl for non-contig tensors (#97730)
Fix logic in `test_block_addmm` that tested op against itself rather than against dense implementation, by implementing `ref_addvm` function that converts tensor back to dense before multiplying it with vector.

Fix reference implementation by passing stride for vector and result. (Not sure wether it will be more perf efficient to iterate over strided tensor or request a dense copy as MKL implementation does)

Print more verbose error message if values differ.

Fixes https://github.com/pytorch/pytorch/issues/97629 , https://github.com/pytorch/pytorch/issues/97589 ,  https://github.com/pytorch/pytorch/issues/97563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97730
Approved by: https://github.com/cpuhrsch
2023-03-28 20:46:32 +00:00
Nikita Shulga
ad5d81adda [Sparse] Add reference implementation for addmv (#97353)
Partially addresses the problem raised in https://github.com/pytorch/pytorch/issues/96972

Add `test_addmv` and enable `test_block_addmv` on all platforms (so the test could be run on M1)

TODO: Make sure that test_block_addmv non-contiguous mode actually
generate non-contiguous as rigth now it probably does not, as test
passes assuming values are contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97353
Approved by: https://github.com/cpuhrsch
2023-03-24 06:14:32 +00:00
haozhe.zhu
fe0afc5852 use accumulate type in BF16 gemm(include dot, mv) ref path (#96074)
Fix https://github.com/pytorch/pytorch/issues/95125 and https://github.com/pytorch/pytorch/issues/83863 for bf16 accumulation in gemm ref path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96074
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-03-23 01:22:59 +00:00
Nikita Vedeneev
55cf7eef86 add/add_ for sparse compressed formats: fix silent index downcast int64 -> int32 (#95294)
Fixes https://github.com/pytorch/pytorch/issues/95224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95294
Approved by: https://github.com/cpuhrsch, https://github.com/amjames
2023-03-10 17:51:40 +00:00
Nikita Vedeneev
98a4d74a68 COO intersection primitives: performance improvement (#96094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96094
Approved by: https://github.com/pearu
2023-03-07 13:21:29 +00:00
Nikita Vedeneev
d809020fc8 Triton kernel for bsr @ dense (#94823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94823
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-03-03 15:11:28 +00:00
PyTorch MergeBot
d7637801d3 Revert "COO intersection primitives: performance improvement (#92976)"
This reverts commit b033594943.

Reverted https://github.com/pytorch/pytorch/pull/92976 on behalf of https://github.com/seemethere due to Need to revert this so I can revert https://github.com/pytorch/pytorch/pull/94048 cleanly
2023-03-03 01:38:56 +00:00
Nikita Vedeneev
b033594943 COO intersection primitives: performance improvement (#92976)
This PR improves COO intersection primitives by:
* making it sync-less (dims <= 8, can be changed to any value that fits stack).
* improving performance with much less kernel calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92976
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-03-02 17:42:39 +00:00
Nikita Vedeneev
325b43661e add/add_ for compressed sparse inputs: bypass BLAS in some trivial cases (#95293)
In `add(self, other, out=...)` we can bypass calls to BLAS in cases when `self == other == out` and `self == other`.

This PR fixes the repro from https://github.com/pytorch/pytorch/issues/94966, but the issue is still present when `x.add_(x)` is replaced, say, with `x = x.clone().add_(x)`.
Could that be a synchronization issue? CC @IvanYashchuk .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95293
Approved by: https://github.com/cpuhrsch
2023-02-27 16:06:02 +00:00
mingfeima
c620ece726 port sparse_mm.reduce to pytorch and optimize it on CPU (#83727)
### Motivation of this PR

This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

**GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

* COO: the hotspot is `scatter_reduce`
* CSR: the hotspot is `spmm_reduce`

The reduce type can be choose from: "max", "mean", "max",  "min".

extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally.
`sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs:
* `out` - the actual output
* `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

### Performance

Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.

#### before:
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
       torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                 aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                 aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                     aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                   aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
              aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                   aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
            aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
```

#### after
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                 aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                 aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                     aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                   aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
              aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                   aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                 aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
            aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu
2023-02-10 15:56:40 +00:00
Aleksandar Samardžić
e1f17b3530 Add CSR->BSC and CSC->BSR conversions (#93301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93301
Approved by: https://github.com/cpuhrsch
2023-02-07 19:22:05 +00:00
Nikita Vedeneev
bb6af061a0 torch.triangular_solve for CSR: materialize diagonal elements when unitriangular=True. (#93352)
Fixes https://github.com/pytorch/pytorch/issues/88890

A temporary fix until MKL is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93352
Approved by: https://github.com/cpuhrsch
2023-01-31 16:33:57 +00:00
Aleksandar Samardžić
53f7fb9a22 Add CSC->BSC conversion (#92307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92307
Approved by: https://github.com/cpuhrsch
2023-01-30 17:03:36 +00:00
Pearu Peterson
65d6802e2f Improve error messages for sparse methods on tensors with unsupported backends/layouts. (#93149)
Fixes https://github.com/pytorch/pytorch/issues/92790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93149
Approved by: https://github.com/cpuhrsch
2023-01-27 19:50:23 +00:00
PyTorch MergeBot
7012d985fa Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 46f16b9363.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/ZainRizvi due to Causing a test to fail consistently: test_decomp.py::HasDecompTest::test_has_decomposition
2023-01-26 16:22:29 +00:00
Nikita Vedeneev
46f16b9363 Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)
As per title.

Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-26 07:58:27 +00:00
Eddie Yan
0bf7506051 [CUDA] Drop CUDA < 11.0 test flags (#92605)
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.

CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
2023-01-24 04:34:06 +00:00
Yanbo Liang
0ab4ab9f8d [Dynamo] Fix calling UserDefinedObject.func should pass self object (#92050)
Fixes #90834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92050
Approved by: https://github.com/jansel
2023-01-21 05:47:01 +00:00
PyTorch MergeBot
60bf851931 Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 8383b5c488.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/malfet due to This seems to have broke sm_86 testing, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=sm86%20%2F%20test%20(default%2C%203
2023-01-19 23:37:59 +00:00
Nikita Vedeneev
8383b5c488 Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)
As per title.

Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-19 03:14:54 +00:00
PyTorch MergeBot
89f1ad08b4 Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 7f256fff77.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/huydhn due to This breaks lint 7f256fff77
2023-01-17 22:14:37 +00:00
Nikita Vedeneev
7f256fff77 Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)
As per title.

Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-17 21:43:20 +00:00
Pearu Peterson
b3e4f5029b Add check-sparse-tensor-invariants flag to Context - 2nd try. (#92094)
This PR is a copy of https://github.com/pytorch/pytorch/pull/90849 that merge was reverted.

The PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:

`torch.sparse.check_sparse_tensor_invariants` class provides different ways to enable/disable the invariant checking.

`torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden.

The PR fixes https://github.com/pytorch/pytorch/issues/90833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92094
Approved by: https://github.com/cpuhrsch
2023-01-13 14:50:33 +00:00
mingfeima
3ab58fd5ed optimize sampled_addmm performance on CPU (SparseCSR) (#90978)
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-01-12 12:04:07 +00:00
PyTorch MergeBot
c7a22bb7c7 Revert "Add check-sparse-tensor-invariants flag to Context. (#90849)"
This reverts commit b9a035c1c5.

Reverted https://github.com/pytorch/pytorch/pull/90849 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-12 09:58:16 +00:00
Aleksandar Samardžić
8612ec5b90 Implement hybrid sparse to/from dense conversions. (#90177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90177
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-01-12 03:31:30 +00:00
PyTorch MergeBot
c5836153f5 Revert "optimize sampled_addmm performance on CPU (SparseCSR) (#90978)"
This reverts commit 645fb217c0.

Reverted https://github.com/pytorch/pytorch/pull/90978 on behalf of https://github.com/seemethere due to This broke internal builds for android due to the new file added being missing in build_variables.bzl
2023-01-11 20:12:12 +00:00
Pearu Peterson
b9a035c1c5 Add check-sparse-tensor-invariants flag to Context. (#90849)
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:

- `torch.enable_check_sparse_tensor_invariants` and `torch.is_check_sparse_tensor_invariants_enabled` functions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectively
- `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden.

The PR also fixes https://github.com/pytorch/pytorch/issues/90833

# Main issue

*The following content is outdated after merging the PRs in this ghstack but kept for the record.*

The importance of this feature is that when enabling the invariants checks by default, say, via

<details>

```
$ git diff
diff --git a/torch/__init__.py b/torch/__init__.py
index c8543057c7..19a91d0482 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ:

 # Populate magic methods on SymInt and SymFloat
 import torch.fx.experimental.symbolic_shapes
+
+# temporarily enable sparse tensor arguments validation in unsafe
+# constructors:
+
+torch._C._set_check_sparse_tensor_invariants(True)
```

</details>

a massive number of test failures/errors occur in test_sparse_csr.py tests:
```
$ pytest -sv test/test_sparse_csr.py
<snip>
==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ====
```
that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised:

```
AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor"

RuntimeError: CUDA error: device-side assert triggered

RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied.

RuntimeError: expected col_indices to be a strided and contiguous tensor

RuntimeError: expected row_indices to be a strided and contiguous tensor

RuntimeError: expected values to be a strided and contiguous tensor

RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered

RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90849
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-01-11 01:05:14 +00:00
mingfeima
645fb217c0 optimize sampled_addmm performance on CPU (SparseCSR) (#90978)
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-01-10 22:13:35 +00:00
Pearu Peterson
cdc30048e5 Fix numel() result after resizing a sparse compressed tensor. (#91831)
Fixes #91830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91831
Approved by: https://github.com/cpuhrsch
2023-01-10 18:21:07 +00:00
Pearu Peterson
b797a24259 Support indices contiguity per batch and non-contiguous values in sparse compressed tensors (#91243)
Fixes https://github.com/pytorch/pytorch/issues/91062

With this PR, all reported failures in https://github.com/pytorch/pytorch/pull/90849 are resolved (modulo test_bmm that uses an unorthodox way to construct a batch CSR tensor).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91243
Approved by: https://github.com/nikitaved, https://github.com/amjames, https://github.com/lezcano
2023-01-02 18:08:46 +00:00
Kurt Mohler
08a47549af Rename Tensor._storage to Tensor.untyped_storage and update docs (#91414)
Fixes #89224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91414
Approved by: https://github.com/ezyang
2022-12-28 19:21:34 +00:00
Nikita Vedeneev
4c5928e387 Fix for mul(compressed, wrapped scalar) (#91239)
Fixes https://github.com/pytorch/pytorch/issues/90819.

The path with `Scalar` should have been picked up by the dispatcher, but still the path with a 0-dim wrapped scalar was broken.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91239
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2022-12-22 13:11:13 +00:00
Pearu Peterson
01e7f46215 Ensure sorted indices from the CSR->BSR conversion (#90918)
Fixes https://github.com/pytorch/pytorch/issues/90910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90918
Approved by: https://github.com/cpuhrsch
2022-12-16 15:49:48 +00:00
Nikita Vedeneev
c2c14f9597 Sparse compressed mm: fix for orthogonal inputs (#90917)
Fixes https://github.com/pytorch/pytorch/issues/90836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90917
Approved by: https://github.com/cpuhrsch
2022-12-16 13:08:00 +00:00
Nikita Vedeneev
4dd3de23dd Sparse compressed mm: fix for empty inputs (#90763)
Fixes [#90693
](https://github.com/pytorch/pytorch/issues/90693)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90763
Approved by: https://github.com/cpuhrsch
2022-12-16 12:33:57 +00:00
Pearu Peterson
76c6dfeaa6 Add layout and blocksize arguments to Tensor.to_sparse method (#89502)
This PR extends the `Tensor.to_sparse()` method to `Tensor.to_sparse(layout=None, blocksize=None)` in a BC manner (`layout=None` means `layout=torch.sparse_coo`).

In addition, the PR adds support for the following conversions:
- non-hybrid/hybrid COO tensor to CSR or CSC or a COO tensor
- short, bool, byte, char, bfloat16, int, long, half CSR tensor to a BSR tensor

and fixes the following conversions:
- hybrid COO to COO tensor
- non-batch/batch hybrid BSR to BSR or BSC tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89502
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2022-11-30 20:21:10 +00:00
Pearu Peterson
296e1ba4d0 Row and column select support for block compressed sparse tensors (#88733)
As in the title:

- Support `select` and `select_copy` on block sparse compressed tensors
- Fixes incorrect results when selecting dense dimensions

The PR also improves the performance of indexing sparse compressed tensors considerably:

<details>

Before:

```python
In [3]: a=torch.rand((1000, 1000)).to_sparse_csr()

In [4]: %timeit a.select(0, 0)
606 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit a.select(1, 0)
527 µs ± 57.7 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit a[0, 0]
617 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: a = a.cuda()

In [8]: %timeit a.select(0, 0); torch.cuda.synchronize();
1.19 ms ± 137 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit a.select(1, 0); torch.cuda.synchronize();
1.2 ms ± 119 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit a[0, 0]; torch.cuda.synchronize();
1.23 ms ± 482 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

This PR:

```python
In [3]: a=torch.rand((1000, 1000)).to_sparse_csr()

In [4]: %timeit a.select(0, 0)
4.75 µs ± 8.94 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [5]: %timeit a.select(1, 0)
565 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit a[0, 0]
13.1 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: a = a.cuda()

In [8]: %timeit a.select(0, 0); torch.cuda.synchronize();
21.6 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]: %timeit a.select(1, 0); torch.cuda.synchronize();
1.15 ms ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit a[0, 0]; torch.cuda.synchronize();
63.7 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88733
Approved by: https://github.com/nikitaved, https://github.com/amjames, https://github.com/cpuhrsch
2022-11-30 11:15:56 +00:00
Pearu Peterson
90bed8874f Generator of tensor inputs with variable layout and structure (batch/non-batch, hybrid/non-hybrid, block/non-block) (#88914)
This PR introduces `TestCase.generate_simple_inputs` method that is an improved and generalized version of the `TestSparseCompressed._generate_small_inputs` method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88914
Approved by: https://github.com/cpuhrsch
2022-11-30 02:13:33 +00:00
Pearu Peterson
50e2e4faf3 Sparse CSC/BSR/BSC serialization and pickle support (#89553)
Fixes https://github.com/pytorch/pytorch/issues/89497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89553
Approved by: https://github.com/cpuhrsch
2022-11-23 20:56:48 +00:00
Andrew M. James
a41f70603a Round out rad2deg sparse support (#88442)
- Add sparse coo dispatch
- Modify backward to work with sparse compressed layouts
- Enable sparse_compressed autograd testing
- Correct layout support attributes on OpInfo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88442
Approved by: https://github.com/cpuhrsch
2022-11-17 06:00:23 +00:00
Nikita Vedeneev
8dc3353b0b add to(dtype) support for all sparse compressed formats (#89055)
Fixes [#88419](https://github.com/pytorch/pytorch/issues/88419)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89055
Approved by: https://github.com/cpuhrsch
2022-11-15 21:16:18 +00:00
Kazuaki Ishizaki
03296844aa Fix typos in messages under aten (#88964)
This PR fixes typos of messages and parms in c++ source files under `aten` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88964
Approved by: https://github.com/lezcano
2022-11-14 09:50:50 +00:00
Andrew M. James
ff6770a9a1 enable backward for log1p (sparse layouts) (#88155)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88155
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:26 +00:00
Andrew M. James
6938dd0b2c Support sparse inputs to deg2rad (#88156)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88156
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:26 +00:00
Andrew M. James
1964d8c34f Enable sparse_csr autograd testing for relu (#88154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88154
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:23 +00:00
Andrew M. James
f03302ba49 Add sparse layout support for torch.frac (#88153)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88153
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:22 +00:00
Andrew M. James
b2dfd20260 Remove BSC conversion skip from TestSparseCompressed.test_consistency (#88152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88152
Approved by: https://github.com/cpuhrsch
2022-11-01 22:18:56 +00:00
Andrew M. James
d044b4cc58 Update torch.abs and torch.positive opinfos to reflect sparse support (#88151)
cc @nikitaved @pearu @cpuhrsch @bhosmer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88151
Approved by: https://github.com/cpuhrsch
2022-11-01 22:18:56 +00:00
Ivan Yashchuk
51ea441862 Upcast to fp32 in test_addmm_block ref_half_bfloat16 (#86682)
Fixes https://github.com/pytorch/pytorch/issues/86681
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86682
Approved by: https://github.com/nikitaved
2022-10-11 16:39:57 +00:00
nikitaved
e15a48def7 (bsr/csr) x dense mm (#85551)
As per title. This implementation is not the most optimal and could be improved albeit with native kernels (i.e. block matching need not be materialized).

Compared to existing kernels it offers:

- Half float support (In fact, any dtype that supports `matmul` will work).
- Arbitrary block sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85551
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2022-09-29 17:12:04 +00:00
Andrew M. James
8a926b3187 Enable CSC @ CSC addmm (#85379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85379
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2022-09-27 19:49:31 +00:00
Andrew M. James
bb5001ce3d Enable dense x bsc mm/addmm (#85308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85308
Approved by: https://github.com/pearu
2022-09-27 19:49:31 +00:00
Andrew M. James
aaef5d8f2c sparse mm/addmm enable dense x csc, csc x dense and simplify layout check logic. (#85307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85307
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2022-09-27 16:46:28 +00:00
Andrew M. James
f64857189d resize_as_sparse support all compressed layouts (#85378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85378
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2022-09-27 06:59:18 +00:00
George Qi
686555b663 [maskedtensor] port torch/_masked into torch/masked (#85515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85515
Approved by: https://github.com/cpuhrsch
2022-09-26 23:41:13 +00:00