Oguz Ulgen
1df14f1bf8
Move has_triton to top level triton utils so that dynamo can also access ( #109832 )
...
it without creating cyclic dependencies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832
Approved by: https://github.com/zou3519
2023-09-22 19:33:41 +00:00
Shunting Zhang
e68b3ad14f
update triton pin with needed inductor change ( #107722 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107722
Approved by: https://github.com/jansel , https://github.com/cpuhrsch
2023-08-29 04:31:44 +00:00
Pearu Peterson
d7c0c5de2d
Set crow_indices outputs as non-differentiable. ( #107447 )
...
Fixes https://github.com/pytorch/pytorch/issues/107083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107447
Approved by: https://github.com/cpuhrsch
2023-08-21 19:52:32 +00:00
rraminen
239578beff
[ROCm] Enable a few bfloat16 unit tests ( #105177 )
...
Currently a few unit tests from **test_matmul_cuda** and **test_sparse_csr** test suites are being skipped on ROCm.
This PR is to enable the following unit tests on ROCm (~30 UTs):
test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA)
test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177
Approved by: https://github.com/pruthvistony , https://github.com/jithunnair-amd , https://github.com/malfet
2023-08-03 21:17:19 +00:00
yanbing-j
a54043516f
Add SparseCsrCPU and SparseCsrCUDA dispatch to sum.dim_IntList ( #99292 )
...
This PR is to add support of sum.dim_IntList for Sparse Tensor, which is exposed in https://github.com/pytorch/pytorch/issues/98796 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99292
Approved by: https://github.com/mingfeima , https://github.com/rusty1s , https://github.com/cpuhrsch
2023-07-24 17:30:58 +00:00
Justin Chu
73e1455327
[BE] Enable ruff's UP rules and autoformat test/ ( #105434 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
nikitaved
44c8515d0d
SDPA: frontend for BSR masks ( #104042 )
...
This PR implements a (yet private) frontend for scaled_dot_product_attention that works with BSR `attn_mask`.
This function is directly comparable (with suitable masks) with `torch.nn.functional.scaled_dot_product_attention` once `attn_mask.dtype == torch.bool`, but it's behavior is different when `attn_mask.dtype != torch.bool`. This is because `torch.nn.functional.scaled_dot_product_attention` assumes that irrelevant values are supposed to be filled with `-inf`, while the selected ones should be `0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104042
Approved by: https://github.com/amjames , https://github.com/cpuhrsch
2023-07-13 18:01:21 +00:00
yanbing-j
053654b9cf
Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend ( #103427 )
...
### Description
This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057 . Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.
Next step:
- [x] Add benchmarks
- [x] Extend to Half
- [x] Simplify code
### Performance test (Updated)
Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp
Single socket (40C)

Single core

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima , https://github.com/albanD
2023-07-13 09:34:29 +00:00
PyTorch MergeBot
f8aedf1efe
Revert "Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend ( #103427 )"
...
This reverts commit da7675621e .
Reverted https://github.com/pytorch/pytorch/pull/103427 on behalf of https://github.com/clee2000 due to sorry but it looks like this pr broke test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_expanded_index_cpu_bfloat16 on periodic parallelnative testing da7675621e https://github.com/pytorch/pytorch/actions/runs/5477783108/jobs/9977608393 ([comment](https://github.com/pytorch/pytorch/pull/103427#issuecomment-1624008753 ))
2023-07-06 17:02:03 +00:00
yanbing-j
da7675621e
Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend ( #103427 )
...
### Description
This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057 . Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.
Next step:
- [x] Add benchmarks
- [x] Extend to Half
- [x] Simplify code
### Performance test (Updated)
Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp
Single socket (40C)

Single core

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima , https://github.com/albanD
2023-07-06 01:23:56 +00:00
Andrew M. James
5364366f8c
Sparse Compressed mm avoid creating temp sparse ( #104062 )
...
When mm forwards to addmm it creates a zeroed out self this tensor
should take options from the result not one of the sparse arguments.
The bug was leading to an error when calling linear with an `out` kwarg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104062
Approved by: https://github.com/nikitaved , https://github.com/pearu
2023-06-26 16:45:04 +00:00
Aleksandar Samardžić
09fdea8564
Fix autograd issue with identity conversions ( #92022 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92022
Approved by: https://github.com/pearu , https://github.com/mtaaooby , https://github.com/amjames , https://github.com/cpuhrsch
2023-06-21 21:23:03 +00:00
Nikita Vedeneev
39a22e2791
softmax: Triton kernel for BSR inputs ( #102095 )
...
Implements `softmax` Triton kernel for BSR inputs. So far, only over `dim=-1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102095
Approved by: https://github.com/cpuhrsch
2023-06-21 01:23:27 +00:00
Pearu Peterson
cbe270d233
Fix zeros_like for sparse tensors with batch dimensions. Add opinfo-based tests to like-functions. ( #101215 )
...
Fixes #101078
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101215
Approved by: https://github.com/cpuhrsch
2023-06-13 16:02:10 +00:00
Xiao Wang
6340aa5d58
Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR [v2] ( #102660 )
...
Test was originally skipped in https://github.com/pytorch/pytorch/pull/98462
Not sure why it was removed in https://github.com/pytorch/pytorch/pull/94825
Now the test hits CUDA illegal memory access on H100 again after https://github.com/pytorch/pytorch/pull/101163
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102660
Approved by: https://github.com/zou3519
2023-06-01 20:36:45 +00:00
Pearu Peterson
9f97b7c43b
Add integer overflow checks for large compressed tensor dimensions and nnz ( #102530 )
...
With the previous PR allowing large compressed tensors (dimensions larger than `2 ** 31 - 1`), sparse compressed tensor invariants checks may give false-positive results:
```python
>>> nnz=2**31
>>> torch.sparse.check_sparse_tensor_invariants.enable()
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.zeros(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
tensor(crow_indices=tensor([ 0, 1, 2, ...,
2147483646, 2147483647, -2147483648]),
col_indices=tensor([0, 0, 0, ..., 0, 0, 0]),
values=tensor([1., 1., 1., ..., 1., 1., 1.]), size=(2147483648, 1),
nnz=2147483648, layout=torch.sparse_csr)
```
(notice that the last entry in `crow_indices` is invalid) or raise a bogus exception as in
```python
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.arange(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: `0 <= col_indices < ncols` is not satisfied.
```
(notice that `col_indices` is actually valid).
This PR fixes the above-reported bugs by introducing integer overflow checks for sparse compressed tensors dimensions as well as nnz.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102530
Approved by: https://github.com/nikitaved
2023-05-31 15:34:08 +00:00
Nikita Vedeneev
d80d3b18d0
nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations ( #98403 )
...
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08f7a6a</samp>
This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403
Approved by: https://github.com/malfet , https://github.com/cpuhrsch
2023-05-31 13:09:45 +00:00
Pearu Peterson
fcbdbd6682
Fix silent nnz overflow for large sparse compressed tensors. ( #102523 )
...
Fixes https://github.com/pytorch/pytorch/issues/102520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102523
Approved by: https://github.com/nikitaved , https://github.com/cpuhrsch
2023-05-30 16:58:01 +00:00
Nikita Vedeneev
6c7410ddc3
sampled_addmm: BSR support ( #101163 )
...
This PR implements a `sampled_addmm` kernel that works with a BSR mask.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101163
Approved by: https://github.com/cpuhrsch
2023-05-25 12:33:50 +00:00
Nikita Vedeneev
346e1f512f
sparse compressed validation: allow empty-batched inputs ( #101180 )
...
Fixes https://github.com/pytorch/pytorch/issues/101179 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101180
Approved by: https://github.com/pearu , https://github.com/cpuhrsch
2023-05-11 20:30:20 +00:00
Nikita Vedeneev
dd2c22f4bb
bsr_dense_bmm(): enable more precise float32 support with float64 accumulators ( #100882 )
...
Float64 is there in Triton! This PR increases precision for float32 inputs with float64 accumulation dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100882
Approved by: https://github.com/cpuhrsch
2023-05-11 11:22:55 +00:00
Pearu Peterson
92a7640b76
Add mul tests with sparse sample inputs ( #100393 )
...
This PR implements sparse sample inputs and error inputs for mul OpInfo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100393
Approved by: https://github.com/amjames , https://github.com/cpuhrsch
2023-05-09 16:13:14 +00:00
Nikita Vedeneev
0141a242fd
bsr_dense_bmm(): remove sparse_rowspace kernel and some dead code ( #100876 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100876
Approved by: https://github.com/cpuhrsch , https://github.com/Skylion007
2023-05-09 16:12:11 +00:00
Nikita Vedeneev
c4bc259f00
bsr_dense_mm(): better test coverage ( #100543 )
...
This PR improves test coverage for `bsr_dense_mm` by:
- ~~enabling correctness tests for `float32`~~.
- extending and testing input correctness checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100543
Approved by: https://github.com/cpuhrsch , https://github.com/malfet
2023-05-09 09:26:02 +00:00
Pearu Peterson
3ae0e23b90
Fix sum OpInfo for sparse sample inputs and assert coverage for sparse-enabled operators ( #100391 )
...
This PR enables sum tests for sparse sample inputs. Previously, the tests existed but were never run because the sum OpInfo instance was created without specifying `supports_sparse_*=True`. To avoid such mistakes in the future, the following PR https://github.com/pytorch/pytorch/pull/100392 enables the `supports_sparse_*` flags automatically when OpInfo creation specifies `sample_inputs_sparse_*_func`.
In addition, the PR applies several fixes to sum tests for sparse sample inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100391
Approved by: https://github.com/cpuhrsch
2023-05-03 02:04:39 +00:00
Nikita Vedeneev
1adb6fa922
nn.Linear: dispatch to bsr_dense_mm for half and bfloat16 ( #94825 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94825
Approved by: https://github.com/albanD , https://github.com/cpuhrsch
2023-04-15 13:38:42 +00:00
Xiao Wang
bd83b205cc
Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR ( #98462 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98462
Approved by: https://github.com/zou3519
2023-04-10 21:21:06 +00:00
eqy
2fddcf0fc0
[CUDA][CUDA 11] Remove more CUDA 11 version checks ( #92934 )
...
Working on removing stragglers missed in previous CUDA version < 11.0 cleanup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92934
Approved by: https://github.com/ngimel
2023-03-30 19:49:52 +00:00
Aaron Gokaslan
47dca20d80
[BE] Enable flake8-comprehension rule C417 ( #97880 )
...
Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880
Approved by: https://github.com/ezyang , https://github.com/kit1980 , https://github.com/albanD
2023-03-30 14:34:24 +00:00
Sergii Dymchenko
5ab50cf048
Fix shoud/shoudl typos ( #97930 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97930
Approved by: https://github.com/clee2000
2023-03-30 08:27:16 +00:00
Nikita Shulga
2c16b73a1b
Remove comma from parametrized test name ( #97844 )
...
Using `name_fn` argument of `@paramterize` decorator.
As internal test runner can't figure out how to parse those, otherwise this is a no-op.
For those with intern access, see [T149211516](https://www.internalfb.com/intern/tasks/?t=149211516 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97844
Approved by: https://github.com/weiwangmeta
2023-03-29 14:20:13 +00:00
Nikita Shulga
b443198966
Fix sparse addmv ref impl for non-contig tensors ( #97730 )
...
Fix logic in `test_block_addmm` that tested op against itself rather than against dense implementation, by implementing `ref_addvm` function that converts tensor back to dense before multiplying it with vector.
Fix reference implementation by passing stride for vector and result. (Not sure wether it will be more perf efficient to iterate over strided tensor or request a dense copy as MKL implementation does)
Print more verbose error message if values differ.
Fixes https://github.com/pytorch/pytorch/issues/97629 , https://github.com/pytorch/pytorch/issues/97589 , https://github.com/pytorch/pytorch/issues/97563
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97730
Approved by: https://github.com/cpuhrsch
2023-03-28 20:46:32 +00:00
Nikita Shulga
ad5d81adda
[Sparse] Add reference implementation for addmv ( #97353 )
...
Partially addresses the problem raised in https://github.com/pytorch/pytorch/issues/96972
Add `test_addmv` and enable `test_block_addmv` on all platforms (so the test could be run on M1)
TODO: Make sure that test_block_addmv non-contiguous mode actually
generate non-contiguous as rigth now it probably does not, as test
passes assuming values are contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97353
Approved by: https://github.com/cpuhrsch
2023-03-24 06:14:32 +00:00
haozhe.zhu
fe0afc5852
use accumulate type in BF16 gemm(include dot, mv) ref path ( #96074 )
...
Fix https://github.com/pytorch/pytorch/issues/95125 and https://github.com/pytorch/pytorch/issues/83863 for bf16 accumulation in gemm ref path
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96074
Approved by: https://github.com/lezcano , https://github.com/peterbell10
2023-03-23 01:22:59 +00:00
Nikita Vedeneev
55cf7eef86
add/add_ for sparse compressed formats: fix silent index downcast int64 -> int32 ( #95294 )
...
Fixes https://github.com/pytorch/pytorch/issues/95224 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95294
Approved by: https://github.com/cpuhrsch , https://github.com/amjames
2023-03-10 17:51:40 +00:00
Nikita Vedeneev
98a4d74a68
COO intersection primitives: performance improvement ( #96094 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96094
Approved by: https://github.com/pearu
2023-03-07 13:21:29 +00:00
Nikita Vedeneev
d809020fc8
Triton kernel for bsr @ dense ( #94823 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94823
Approved by: https://github.com/cpuhrsch , https://github.com/malfet
2023-03-03 15:11:28 +00:00
PyTorch MergeBot
d7637801d3
Revert "COO intersection primitives: performance improvement ( #92976 )"
...
This reverts commit b033594943 .
Reverted https://github.com/pytorch/pytorch/pull/92976 on behalf of https://github.com/seemethere due to Need to revert this so I can revert https://github.com/pytorch/pytorch/pull/94048 cleanly
2023-03-03 01:38:56 +00:00
Nikita Vedeneev
b033594943
COO intersection primitives: performance improvement ( #92976 )
...
This PR improves COO intersection primitives by:
* making it sync-less (dims <= 8, can be changed to any value that fits stack).
* improving performance with much less kernel calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92976
Approved by: https://github.com/cpuhrsch , https://github.com/pearu
2023-03-02 17:42:39 +00:00
Nikita Vedeneev
325b43661e
add/add_ for compressed sparse inputs: bypass BLAS in some trivial cases ( #95293 )
...
In `add(self, other, out=...)` we can bypass calls to BLAS in cases when `self == other == out` and `self == other`.
This PR fixes the repro from https://github.com/pytorch/pytorch/issues/94966 , but the issue is still present when `x.add_(x)` is replaced, say, with `x = x.clone().add_(x)`.
Could that be a synchronization issue? CC @IvanYashchuk .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95293
Approved by: https://github.com/cpuhrsch
2023-02-27 16:06:02 +00:00
mingfeima
c620ece726
port sparse_mm.reduce to pytorch and optimize it on CPU ( #83727 )
...
### Motivation of this PR
This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300
**GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:
* COO: the hotspot is `scatter_reduce`
* CSR: the hotspot is `spmm_reduce`
The reduce type can be choose from: "max", "mean", "max", "min".
extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally.
`sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs:
* `out` - the actual output
* `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.
### Performance
Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.
Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.
#### before:
```
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
torch_sparse::spmm_sum 97.09% 56.086s 97.09% 56.088s 6.232s 9
aten::linear 0.00% 85.000us 1.38% 795.485ms 88.387ms 9
aten::matmul 0.00% 57.000us 1.38% 795.260ms 88.362ms 9
aten::mm 1.38% 795.201ms 1.38% 795.203ms 88.356ms 9
aten::relu 0.00% 50.000us 0.76% 440.434ms 73.406ms 6
aten::clamp_min 0.76% 440.384ms 0.76% 440.384ms 73.397ms 6
aten::add_ 0.57% 327.801ms 0.57% 327.801ms 36.422ms 9
aten::log_softmax 0.00% 23.000us 0.10% 55.503ms 18.501ms 3
```
#### after
```
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
----------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::spmm_sum 87.35% 11.826s 87.36% 11.827s 1.314s 9
aten::linear 0.00% 92.000us 5.87% 794.451ms 88.272ms 9
aten::matmul 0.00% 62.000us 5.87% 794.208ms 88.245ms 9
aten::mm 5.87% 794.143ms 5.87% 794.146ms 88.238ms 9
aten::relu 0.00% 53.000us 3.35% 452.977ms 75.496ms 6
aten::clamp_min 3.35% 452.924ms 3.35% 452.924ms 75.487ms 6
aten::add_ 2.58% 348.663ms 2.58% 348.663ms 38.740ms 9
aten::argmax 0.42% 57.473ms 0.42% 57.475ms 14.369ms 4
aten::log_softmax 0.00% 22.000us 0.39% 52.605ms 17.535ms 3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727
Approved by: https://github.com/jgong5 , https://github.com/cpuhrsch , https://github.com/rusty1s , https://github.com/pearu
2023-02-10 15:56:40 +00:00
Aleksandar Samardžić
e1f17b3530
Add CSR->BSC and CSC->BSR conversions ( #93301 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93301
Approved by: https://github.com/cpuhrsch
2023-02-07 19:22:05 +00:00
Nikita Vedeneev
bb6af061a0
torch.triangular_solve for CSR: materialize diagonal elements when unitriangular=True. (#93352 )
...
Fixes https://github.com/pytorch/pytorch/issues/88890
A temporary fix until MKL is fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93352
Approved by: https://github.com/cpuhrsch
2023-01-31 16:33:57 +00:00
Aleksandar Samardžić
53f7fb9a22
Add CSC->BSC conversion ( #92307 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92307
Approved by: https://github.com/cpuhrsch
2023-01-30 17:03:36 +00:00
Pearu Peterson
65d6802e2f
Improve error messages for sparse methods on tensors with unsupported backends/layouts. ( #93149 )
...
Fixes https://github.com/pytorch/pytorch/issues/92790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93149
Approved by: https://github.com/cpuhrsch
2023-01-27 19:50:23 +00:00
PyTorch MergeBot
7012d985fa
Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. ( #88078 )"
...
This reverts commit 46f16b9363 .
Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/ZainRizvi due to Causing a test to fail consistently: test_decomp.py::HasDecompTest::test_has_decomposition
2023-01-26 16:22:29 +00:00
Nikita Vedeneev
46f16b9363
Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. ( #88078 )
...
As per title.
Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-26 07:58:27 +00:00
Eddie Yan
0bf7506051
[CUDA] Drop CUDA < 11.0 test flags ( #92605 )
...
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.
CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
2023-01-24 04:34:06 +00:00
Yanbo Liang
0ab4ab9f8d
[Dynamo] Fix calling UserDefinedObject.func should pass self object ( #92050 )
...
Fixes #90834
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92050
Approved by: https://github.com/jansel
2023-01-21 05:47:01 +00:00
PyTorch MergeBot
60bf851931
Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. ( #88078 )"
...
This reverts commit 8383b5c488 .
Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/malfet due to This seems to have broke sm_86 testing, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=sm86%20%2F%20test%20 (default%2C%203
2023-01-19 23:37:59 +00:00