Commit Graph

231 Commits

Author SHA1 Message Date
Oguz Ulgen
1df14f1bf8 Move has_triton to top level triton utils so that dynamo can also access (#109832)
it without creating cyclic dependencies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832
Approved by: https://github.com/zou3519
2023-09-22 19:33:41 +00:00
Shunting Zhang
e68b3ad14f update triton pin with needed inductor change (#107722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107722
Approved by: https://github.com/jansel, https://github.com/cpuhrsch
2023-08-29 04:31:44 +00:00
Pearu Peterson
d7c0c5de2d Set crow_indices outputs as non-differentiable. (#107447)
Fixes https://github.com/pytorch/pytorch/issues/107083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107447
Approved by: https://github.com/cpuhrsch
2023-08-21 19:52:32 +00:00
rraminen
239578beff [ROCm] Enable a few bfloat16 unit tests (#105177)
Currently a few unit tests from **test_matmul_cuda** and **test_sparse_csr** test suites are being skipped on ROCm.

This PR is to enable the following unit tests on ROCm (~30 UTs):

test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA)
test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-03 21:17:19 +00:00
yanbing-j
a54043516f Add SparseCsrCPU and SparseCsrCUDA dispatch to sum.dim_IntList (#99292)
This PR is to add support of sum.dim_IntList for Sparse Tensor, which is exposed in https://github.com/pytorch/pytorch/issues/98796.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99292
Approved by: https://github.com/mingfeima, https://github.com/rusty1s, https://github.com/cpuhrsch
2023-07-24 17:30:58 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
nikitaved
44c8515d0d SDPA: frontend for BSR masks (#104042)
This PR implements a (yet private) frontend for scaled_dot_product_attention that works with BSR `attn_mask`.

This function is directly comparable (with suitable masks) with `torch.nn.functional.scaled_dot_product_attention` once `attn_mask.dtype == torch.bool`, but it's behavior is different when `attn_mask.dtype != torch.bool`. This is because `torch.nn.functional.scaled_dot_product_attention` assumes that irrelevant values are supposed to be filled with `-inf`, while the selected ones should be `0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104042
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-07-13 18:01:21 +00:00
yanbing-j
053654b9cf Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-13 09:34:29 +00:00
PyTorch MergeBot
f8aedf1efe Revert "Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)"
This reverts commit da7675621e.

Reverted https://github.com/pytorch/pytorch/pull/103427 on behalf of https://github.com/clee2000 due to sorry but it looks like this pr broke test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_expanded_index_cpu_bfloat16 on periodic parallelnative testing da7675621e https://github.com/pytorch/pytorch/actions/runs/5477783108/jobs/9977608393 ([comment](https://github.com/pytorch/pytorch/pull/103427#issuecomment-1624008753))
2023-07-06 17:02:03 +00:00
yanbing-j
da7675621e Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-06 01:23:56 +00:00
Andrew M. James
5364366f8c Sparse Compressed mm avoid creating temp sparse (#104062)
When mm forwards to addmm it creates a zeroed out self this tensor
should take options from the result not one of the sparse arguments.

The bug was leading to an error when calling linear with an `out` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104062
Approved by: https://github.com/nikitaved, https://github.com/pearu
2023-06-26 16:45:04 +00:00
Aleksandar Samardžić
09fdea8564 Fix autograd issue with identity conversions (#92022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92022
Approved by: https://github.com/pearu, https://github.com/mtaaooby, https://github.com/amjames, https://github.com/cpuhrsch
2023-06-21 21:23:03 +00:00
Nikita Vedeneev
39a22e2791 softmax: Triton kernel for BSR inputs (#102095)
Implements `softmax` Triton kernel for BSR inputs. So far, only over `dim=-1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102095
Approved by: https://github.com/cpuhrsch
2023-06-21 01:23:27 +00:00
Pearu Peterson
cbe270d233 Fix zeros_like for sparse tensors with batch dimensions. Add opinfo-based tests to like-functions. (#101215)
Fixes #101078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101215
Approved by: https://github.com/cpuhrsch
2023-06-13 16:02:10 +00:00
Xiao Wang
6340aa5d58 Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR [v2] (#102660)
Test was originally skipped in https://github.com/pytorch/pytorch/pull/98462

Not sure why it was removed in https://github.com/pytorch/pytorch/pull/94825

Now the test hits CUDA illegal memory access on H100 again after https://github.com/pytorch/pytorch/pull/101163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102660
Approved by: https://github.com/zou3519
2023-06-01 20:36:45 +00:00
Pearu Peterson
9f97b7c43b Add integer overflow checks for large compressed tensor dimensions and nnz (#102530)
With the previous PR allowing large compressed tensors (dimensions larger than `2 ** 31 - 1`), sparse compressed tensor invariants checks may give false-positive results:
```python
>>> nnz=2**31
>>> torch.sparse.check_sparse_tensor_invariants.enable()
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.zeros(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
tensor(crow_indices=tensor([          0,           1,           2,  ...,
                             2147483646,  2147483647, -2147483648]),
       col_indices=tensor([0, 0, 0,  ..., 0, 0, 0]),
       values=tensor([1., 1., 1.,  ..., 1., 1., 1.]), size=(2147483648, 1),
       nnz=2147483648, layout=torch.sparse_csr)
```
(notice that the last entry in `crow_indices` is invalid) or raise a bogus exception as in
```python
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.arange(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: `0 <= col_indices < ncols` is not satisfied.
```
(notice that `col_indices` is actually valid).

This PR fixes the above-reported bugs by introducing integer overflow checks for sparse compressed tensors dimensions as well as nnz.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102530
Approved by: https://github.com/nikitaved
2023-05-31 15:34:08 +00:00
Nikita Vedeneev
d80d3b18d0 nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08f7a6a</samp>

This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403
Approved by: https://github.com/malfet, https://github.com/cpuhrsch
2023-05-31 13:09:45 +00:00
Pearu Peterson
fcbdbd6682 Fix silent nnz overflow for large sparse compressed tensors. (#102523)
Fixes https://github.com/pytorch/pytorch/issues/102520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102523
Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch
2023-05-30 16:58:01 +00:00
Nikita Vedeneev
6c7410ddc3 sampled_addmm: BSR support (#101163)
This PR implements a `sampled_addmm` kernel that works with a BSR mask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101163
Approved by: https://github.com/cpuhrsch
2023-05-25 12:33:50 +00:00
Nikita Vedeneev
346e1f512f sparse compressed validation: allow empty-batched inputs (#101180)
Fixes https://github.com/pytorch/pytorch/issues/101179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101180
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-05-11 20:30:20 +00:00
Nikita Vedeneev
dd2c22f4bb bsr_dense_bmm(): enable more precise float32 support with float64 accumulators (#100882)
Float64 is there in Triton! This PR increases precision for float32 inputs with float64 accumulation dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100882
Approved by: https://github.com/cpuhrsch
2023-05-11 11:22:55 +00:00
Pearu Peterson
92a7640b76 Add mul tests with sparse sample inputs (#100393)
This PR implements sparse sample inputs and error inputs for mul OpInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100393
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-05-09 16:13:14 +00:00
Nikita Vedeneev
0141a242fd bsr_dense_bmm(): remove sparse_rowspace kernel and some dead code (#100876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100876
Approved by: https://github.com/cpuhrsch, https://github.com/Skylion007
2023-05-09 16:12:11 +00:00
Nikita Vedeneev
c4bc259f00 bsr_dense_mm(): better test coverage (#100543)
This PR improves test coverage for `bsr_dense_mm` by:
- ~~enabling correctness tests for `float32`~~.
- extending and testing input correctness checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100543
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-05-09 09:26:02 +00:00
Pearu Peterson
3ae0e23b90 Fix sum OpInfo for sparse sample inputs and assert coverage for sparse-enabled operators (#100391)
This PR enables sum tests for sparse sample inputs. Previously, the tests existed but were never run because the sum OpInfo instance was created without specifying `supports_sparse_*=True`. To avoid such mistakes in the future, the following PR https://github.com/pytorch/pytorch/pull/100392 enables the `supports_sparse_*` flags automatically when OpInfo creation specifies `sample_inputs_sparse_*_func`.

In addition, the PR applies several fixes to sum tests for sparse sample inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100391
Approved by: https://github.com/cpuhrsch
2023-05-03 02:04:39 +00:00
Nikita Vedeneev
1adb6fa922 nn.Linear: dispatch to bsr_dense_mm for half and bfloat16 (#94825)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94825
Approved by: https://github.com/albanD, https://github.com/cpuhrsch
2023-04-15 13:38:42 +00:00
Xiao Wang
bd83b205cc Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR (#98462)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98462
Approved by: https://github.com/zou3519
2023-04-10 21:21:06 +00:00
eqy
2fddcf0fc0 [CUDA][CUDA 11] Remove more CUDA 11 version checks (#92934)
Working on removing stragglers missed in previous CUDA version < 11.0 cleanup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92934
Approved by: https://github.com/ngimel
2023-03-30 19:49:52 +00:00
Aaron Gokaslan
47dca20d80 [BE] Enable flake8-comprehension rule C417 (#97880)
Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880
Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD
2023-03-30 14:34:24 +00:00
Sergii Dymchenko
5ab50cf048 Fix shoud/shoudl typos (#97930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97930
Approved by: https://github.com/clee2000
2023-03-30 08:27:16 +00:00
Nikita Shulga
2c16b73a1b Remove comma from parametrized test name (#97844)
Using `name_fn` argument of `@paramterize` decorator.

As internal test runner can't figure out how to parse those, otherwise this is a no-op.

For those with intern access, see [T149211516](https://www.internalfb.com/intern/tasks/?t=149211516)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97844
Approved by: https://github.com/weiwangmeta
2023-03-29 14:20:13 +00:00
Nikita Shulga
b443198966 Fix sparse addmv ref impl for non-contig tensors (#97730)
Fix logic in `test_block_addmm` that tested op against itself rather than against dense implementation, by implementing `ref_addvm` function that converts tensor back to dense before multiplying it with vector.

Fix reference implementation by passing stride for vector and result. (Not sure wether it will be more perf efficient to iterate over strided tensor or request a dense copy as MKL implementation does)

Print more verbose error message if values differ.

Fixes https://github.com/pytorch/pytorch/issues/97629 , https://github.com/pytorch/pytorch/issues/97589 ,  https://github.com/pytorch/pytorch/issues/97563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97730
Approved by: https://github.com/cpuhrsch
2023-03-28 20:46:32 +00:00
Nikita Shulga
ad5d81adda [Sparse] Add reference implementation for addmv (#97353)
Partially addresses the problem raised in https://github.com/pytorch/pytorch/issues/96972

Add `test_addmv` and enable `test_block_addmv` on all platforms (so the test could be run on M1)

TODO: Make sure that test_block_addmv non-contiguous mode actually
generate non-contiguous as rigth now it probably does not, as test
passes assuming values are contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97353
Approved by: https://github.com/cpuhrsch
2023-03-24 06:14:32 +00:00
haozhe.zhu
fe0afc5852 use accumulate type in BF16 gemm(include dot, mv) ref path (#96074)
Fix https://github.com/pytorch/pytorch/issues/95125 and https://github.com/pytorch/pytorch/issues/83863 for bf16 accumulation in gemm ref path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96074
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-03-23 01:22:59 +00:00
Nikita Vedeneev
55cf7eef86 add/add_ for sparse compressed formats: fix silent index downcast int64 -> int32 (#95294)
Fixes https://github.com/pytorch/pytorch/issues/95224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95294
Approved by: https://github.com/cpuhrsch, https://github.com/amjames
2023-03-10 17:51:40 +00:00
Nikita Vedeneev
98a4d74a68 COO intersection primitives: performance improvement (#96094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96094
Approved by: https://github.com/pearu
2023-03-07 13:21:29 +00:00
Nikita Vedeneev
d809020fc8 Triton kernel for bsr @ dense (#94823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94823
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-03-03 15:11:28 +00:00
PyTorch MergeBot
d7637801d3 Revert "COO intersection primitives: performance improvement (#92976)"
This reverts commit b033594943.

Reverted https://github.com/pytorch/pytorch/pull/92976 on behalf of https://github.com/seemethere due to Need to revert this so I can revert https://github.com/pytorch/pytorch/pull/94048 cleanly
2023-03-03 01:38:56 +00:00
Nikita Vedeneev
b033594943 COO intersection primitives: performance improvement (#92976)
This PR improves COO intersection primitives by:
* making it sync-less (dims <= 8, can be changed to any value that fits stack).
* improving performance with much less kernel calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92976
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-03-02 17:42:39 +00:00
Nikita Vedeneev
325b43661e add/add_ for compressed sparse inputs: bypass BLAS in some trivial cases (#95293)
In `add(self, other, out=...)` we can bypass calls to BLAS in cases when `self == other == out` and `self == other`.

This PR fixes the repro from https://github.com/pytorch/pytorch/issues/94966, but the issue is still present when `x.add_(x)` is replaced, say, with `x = x.clone().add_(x)`.
Could that be a synchronization issue? CC @IvanYashchuk .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95293
Approved by: https://github.com/cpuhrsch
2023-02-27 16:06:02 +00:00
mingfeima
c620ece726 port sparse_mm.reduce to pytorch and optimize it on CPU (#83727)
### Motivation of this PR

This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

**GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

* COO: the hotspot is `scatter_reduce`
* CSR: the hotspot is `spmm_reduce`

The reduce type can be choose from: "max", "mean", "max",  "min".

extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally.
`sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs:
* `out` - the actual output
* `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

### Performance

Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.

#### before:
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
       torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                 aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                 aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                     aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                   aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
              aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                   aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
            aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
```

#### after
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                 aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                 aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                     aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                   aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
              aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                   aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                 aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
            aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu
2023-02-10 15:56:40 +00:00
Aleksandar Samardžić
e1f17b3530 Add CSR->BSC and CSC->BSR conversions (#93301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93301
Approved by: https://github.com/cpuhrsch
2023-02-07 19:22:05 +00:00
Nikita Vedeneev
bb6af061a0 torch.triangular_solve for CSR: materialize diagonal elements when unitriangular=True. (#93352)
Fixes https://github.com/pytorch/pytorch/issues/88890

A temporary fix until MKL is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93352
Approved by: https://github.com/cpuhrsch
2023-01-31 16:33:57 +00:00
Aleksandar Samardžić
53f7fb9a22 Add CSC->BSC conversion (#92307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92307
Approved by: https://github.com/cpuhrsch
2023-01-30 17:03:36 +00:00
Pearu Peterson
65d6802e2f Improve error messages for sparse methods on tensors with unsupported backends/layouts. (#93149)
Fixes https://github.com/pytorch/pytorch/issues/92790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93149
Approved by: https://github.com/cpuhrsch
2023-01-27 19:50:23 +00:00
PyTorch MergeBot
7012d985fa Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 46f16b9363.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/ZainRizvi due to Causing a test to fail consistently: test_decomp.py::HasDecompTest::test_has_decomposition
2023-01-26 16:22:29 +00:00
Nikita Vedeneev
46f16b9363 Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)
As per title.

Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-26 07:58:27 +00:00
Eddie Yan
0bf7506051 [CUDA] Drop CUDA < 11.0 test flags (#92605)
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.

CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
2023-01-24 04:34:06 +00:00
Yanbo Liang
0ab4ab9f8d [Dynamo] Fix calling UserDefinedObject.func should pass self object (#92050)
Fixes #90834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92050
Approved by: https://github.com/jansel
2023-01-21 05:47:01 +00:00
PyTorch MergeBot
60bf851931 Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 8383b5c488.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/malfet due to This seems to have broke sm_86 testing, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=sm86%20%2F%20test%20(default%2C%203
2023-01-19 23:37:59 +00:00