Commit Graph

36 Commits

Author SHA1 Message Date
drisspg
9108b74bbc Updates to scaled_mm for rowwise scaling (#130059)
# Summary

This updates _scaled_mm's API to enforce that input scales are always 2 dimensional. This resolves ambiguity around scaling scheme
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130059
Approved by: https://github.com/vkuzo
2024-07-04 00:53:17 +00:00
y-sq
ff026f3d0a Fix an issue in meta_scaled_mm (#129521)
Summary:
To fix the following failure cases:

For example, when `M, K, N = 245760, 656, 6560`, fp8 with compile fails due to `RuntimeError: mat2 must be col_major`.

---------
From the inductor generated code (https://fburl.com/everpaste/epcagkrd)
```
V0625 01:38:55.551000 140329914449920 torch/_inductor/scheduler.py:1623] [0/0] scheduling ComputedBuffer(name='buf12', layout=FixedLayout('cuda', torch.float8_e4m3fn, size=[656, 6560], stride=[6656, 1]),
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code]         buf12 = empty_strided_cuda((656, 6560), (6656, 1), torch.float8_e4m3fn)
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code]     return (buf10, buf2, buf5, buf6, reinterpret_tensor(buf11, (245760, 656), (1, 245760), 0), reinterpret_tensor(buf12, (6560, 656), (1, 6656), 0), )
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code]     assert_size_stride(permute_10, (6560, 656), (1, 6656))
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code]         buf8 = aten._scaled_mm.default(buf6, permute_10, buf7, reciprocal_3, None, None, torch.bfloat16)
```

Inductor gives the mat2 (`permute_10`) a different stride (`6656`) instead of using its shape[0] (`(6560, 656)`).

Therefore, the `stride[1] == shape[0]` condition fails.

To fix the issue, simply modify the `is_col_major` check to exclude this condition as it doesn't hold for all valid cases.

Test Plan:
Run the failed case again. It works with the fix.
-----
Sandcastle / GitHub CI will make sure the existing tests could still pass.

Reviewed By: vkuzo

Differential Revision: D58994704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129521
Approved by: https://github.com/drisspg
2024-06-27 07:03:34 +00:00
Andres Lugo
b9a1c2c991 [ROCm] Enable F8 Inductor Unit tests (#128353)
First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-26 18:30:43 +00:00
drisspg
fcf2a1378b Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989)
# Summary
First PR got reverted and needed a redo

This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989
Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo
2024-06-19 04:49:39 +00:00
drisspg
fc2913fb80 Remove amax return from _scaled_mm (#128683)
# Summary
The primary reason for the change was lack of current use case and the need to work around an two Inductor issue.
- Tensor arguments as kwarg only
- multiple outputs from triton templates

If the need for the amax return type arises we can consider either adding it, more likely creating a separate op.

In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels.

### Changes:
- This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision.
- We currently still allow for fp8 returns and scaled result.  Perhaps we should also ban this as well...

New signature:
```Python
def meta_scaled_mm(
    self: torch.Tensor,
    mat2: torch.Tensor,
    scale_a: torch.Tensor,
    scale_b: torch.Tensor,
    bias: Optional[torch.Tensor] = None,
    scale_result: Optional[torch.Tensor] = None,
    out_dtype: Optional[torch.dtype] = None,
    use_fast_accum: bool = False,
) -> torch.Tensor:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683
Approved by: https://github.com/vkuzo
2024-06-17 16:48:00 +00:00
PyTorch MergeBot
a5b86a1ec0 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 5dc9128229.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))
2024-06-06 16:12:34 +00:00
drisspg
5dc9128229 FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw, https://github.com/malfet
2024-06-05 15:46:40 +00:00
PyTorch MergeBot
d05cddfe23 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 923edef31c.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2145422196))
2024-06-03 15:00:21 +00:00
drisspg
923edef31c FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw
2024-05-31 20:09:08 +00:00
Andres Lugo-Reyes
69c6e0b851 [ROCm] Fix ROCm bug that causes numerical errors in float8_experimental (#123275)
Recently there has been work in an experimental repo to start implementing the intrinsics necessary handle F8 workloads. (see: https://github.com/pytorch-labs/float8_experimental)

A recent PR was submitted to add support for AMD F8 types (fnuz). This PR uncovered a bug in the rocm code that caused unit tests to fail due to numerical inaccuracy. This PR fixes that bug by swapping `abs_()` with `abs()` as the former performs elementwise absolute value on the tensor in-place causing the final assertion to fail due to the tensor only containing positive values.

Important to note, this fix is part of a workaround as hipblasLT does not yet support amax (HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER). This functionality has been implemented internally and is going through the proper channels to propagate to the community.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123275
Approved by: https://github.com/drisspg, https://github.com/jeffdaily
2024-04-10 21:52:02 +00:00
Jeff Daily
96793e0f10 [ROCm] enable scaled_gemm (#117822)
scaled_gemm for ROCm using hipblaslt.  As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported.  A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result.  For this reason the feature should be considered beta/preview.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822
Approved by: https://github.com/jianyuh, https://github.com/xw285cornell
2024-02-29 10:20:48 +00:00
eqy
65efece3a4 [CUDA][cuBLAS] Bump test_cublas_baddbmm_large_input tolerances (#117889)
Unfortunate that the current `rtol=1e-5` hits a literal 1 / 1000000 mismatch (`rtol=1.04e-5`) on L40.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117889
Approved by: https://github.com/atalman
2024-02-27 19:05:20 +00:00
drisspg
6b009aceea Enable scaled_mm on sm89 devices (#118881)
Fixes #118703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118881
Approved by: https://github.com/malfet
2024-02-03 00:44:03 +00:00
Jithun Nair
2ea2421b44 Skip unit tests that fail on MI210 runners (#114613)
Taken from https://github.com/pytorch/pytorch/pull/105980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114613
Approved by: https://github.com/malfet
2023-11-27 22:25:35 +00:00
Eddie Yan
785e586eb0 [CUDA][cuBLAS] Separate reduced precision reductions on/off for addmm tests (#112545)
CC @malfet @ptrblck
~~We've been seeing a lot of noise from Ampere and later devices due to reduced precision reductions, so preemptively disabling them for addmm tests.~~

Breaking out addmm tests into one with and without reduced precision reductions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112545
Approved by: https://github.com/malfet
2023-11-07 19:09:29 +00:00
drisspg
a469aca1cc Exposes a fast_fp8_accum option to _scaled_mm (#111847)
# Summary
Adds the option to use fast_accumulation_mode for the fp8 matmul in scaled_mm

Information can be found here: https://docs.nvidia.com/cuda/cublas/#cublasltmatmuldescattributes-t
defaults to 0 (off)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111847
Approved by: https://github.com/ipiszy, https://github.com/malfet
2023-10-24 03:26:53 +00:00
Christian Puhrsch
3553eb9b89 Add CUTLASS-based support for mixed dtypes matrix multiplication (#110981)
Resubmission without ghstack to make it easier to import https://github.com/pytorch/pytorch/pull/110934/commits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110981
Approved by: https://github.com/drisspg
2023-10-11 21:47:52 +00:00
drisspg
09a17c512d Add better error messaging to scaled_mm (#108454)
Fixes #108411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108454
Approved by: https://github.com/vkuzo
2023-09-07 21:26:47 +00:00
Kurt Mohler
3f88e3105f Reland: Remove remaining global set_default_dtype calls from tests (#108088)
Fixes #68972

Relands #107246

To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
2023-09-07 03:04:34 +00:00
drisspg
d5ff8ca4ef Relax divsibilty by 16 for leading dimension of mat1 in scaled_gemm (#108308)
# Summary
CublasLT requires that the matrices be 16 byte aligned. If mat1.size(-1) % 16 == 0 and the matrix is row major than the leading dimension can be any value. See this coment: https://github.com/pytorch/pytorch/pull/107341#discussion_r1310934737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108308
Approved by: https://github.com/eqy, https://github.com/vkuzo
2023-08-31 20:31:47 +00:00
drisspg
00eed6f367 Better Error Message for invalid Out_dtype + Bias for scaled_mm (#108097)
# Summary
Fixes an error case that was directly throwing Cublasslt error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108097
Approved by: https://github.com/vkuzo
2023-08-29 04:10:17 +00:00
PyTorch MergeBot
161ea463e6 Revert "Remove remaining global set_default_dtype calls from tests (#107246)"
This reverts commit aa8ea1d787.

Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))
2023-08-25 19:34:55 +00:00
Kurt Mohler
aa8ea1d787 Remove remaining global set_default_dtype calls from tests (#107246)
Fixes #68972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246
Approved by: https://github.com/ezyang
2023-08-24 16:10:48 +00:00
drisspg
c093fdf924 Fix wrong hardcoded value for _scaled_mm (#107719)
## Summary
Sneaky lil bug where we were accidentally fusing in relu to the epilogue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107719
Approved by: https://github.com/vkuzo
2023-08-22 21:52:20 +00:00
Driss Guessous
8ccfd801be Introduce CUDA-only _scaled_mm op (#107341)
Summary:
Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8

According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.

See table below for supported input and output types:
| Mat1 type  | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3  | Float8_e4m3  | Float16  | Float8_e4m3, Float16 |
| Float8_e4m3  | Float8_e4m3  | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2  | Float8_e4m3  | Float16 |  Float8_e4m3, Float8_e5m2, Float16  |
| Float8_e5m2  | Float8_e4m3  | BFloat16 |  Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Float16 |  Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3  | Float8_e5m2  | BFloat16 |  Float8_e4m3, Float8_e5m2,  BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Not supported | Not supported |

Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
register_decomposition(aten._scaled_mm)
def _scaled_mm(
    mat1: Tensor,
    mat2: Tensor,
    *,
    dtype: Optional[torch.dtype] = None,
    scale_a: Optional[Tensor] = None,
    scale_b: Optional[Tensor] = None,
    scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
    rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
    rc = scale_a * rc if scale_a is not None else rc
    rc = scale_b * rc if scale_b is not None else rc
    rc = scale_result * rc if scale_result is not None else rc
    rc = rc.to(dtype if dtype is not None else mat1.dtype)
    return rc, torch.tensor(0.0, device=mat1.device)
```

Known limitations:
  - Only works for matrix sizes divisible by 16
  - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)

Test Plan: Tests in test_matmul_cda.py

Differential Revision: D48415871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341
Approved by: https://github.com/vkuzo
2023-08-17 21:24:43 +00:00
PyTorch MergeBot
1af324b560 Revert "Introduce CUDA-only _scaled_mm op (#106844)"
This reverts commit 9440a8cbec.

Reverted https://github.com/pytorch/pytorch/pull/106844 on behalf of https://github.com/izaitsevfb due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/106844#issuecomment-1679858327))
2023-08-16 02:05:29 +00:00
Nikita Shulga
9440a8cbec Introduce CUDA-only _scaled_mm op (#106844)
According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.

See table below for supported input and output types:
| Mat1 type  | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3  | Float8_e4m3  | Float16  | Float8_e4m3, Float16 |
| Float8_e4m3  | Float8_e4m3  | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2  | Float8_e4m3  | Float16 |  Float8_e4m3, Float8_e5m2, Float16  |
| Float8_e5m2  | Float8_e4m3  | BFloat16 |  Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Float16 |  Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3  | Float8_e5m2  | BFloat16 |  Float8_e4m3, Float8_e5m2,  BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Not supported | Not supported |

Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
@register_decomposition(aten._scaled_mm)
def _scaled_mm(
    mat1: Tensor,
    mat2: Tensor,
    *,
    dtype: Optional[torch.dtype] = None,
    scale_a: Optional[Tensor] = None,
    scale_b: Optional[Tensor] = None,
    scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
    rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
    rc = scale_a * rc if scale_a is not None else rc
    rc = scale_b * rc if scale_b is not None else rc
    rc = scale_result * rc if scale_result is not None else rc
    rc = rc.to(dtype if dtype is not None else mat1.dtype)
    return rc, torch.tensor(0.0, device=mat1.device)
```

Known limitations:
  - Only works for matrix sizes divisible by 16
  - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844
Approved by: https://github.com/albanD
ghstack dependencies: #106977
2023-08-15 02:59:41 +00:00
rraminen
239578beff [ROCm] Enable a few bfloat16 unit tests (#105177)
Currently a few unit tests from **test_matmul_cuda** and **test_sparse_csr** test suites are being skipped on ROCm.

This PR is to enable the following unit tests on ROCm (~30 UTs):

test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA)
test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-03 21:17:19 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Eddie Yan
5c5ad53517 [CUBLAS] Specify alignment for cuBlasLt addmm (#98975)
Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~

According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975
Approved by: https://github.com/ngimel
2023-04-18 06:19:30 +00:00
blorange-amd
079452ea0f Enable test_matmul_cuda UTs for ROCm (#98797)
test_file | test_name | test_class
-- | -- | --
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_float32 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_float32 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_float32 | (__main__.TestMatmulCudaCUDA)

This PR is the same fix as https://github.com/pytorch/pytorch/pull/88888. Creating this new PR to sanitize the history.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98797
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-04-13 19:36:07 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
Eddie Yan
0bf7506051 [CUDA] Drop CUDA < 11.0 test flags (#92605)
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.

CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
2023-01-24 04:34:06 +00:00
Eddie Yan
1af40d5108 [cublas][cublasLt] Fall back to unfused addmm for 2-byte-aligned inputs (#92201)
Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214

Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200.

Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called.

We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201
Approved by: https://github.com/ngimel
2023-01-21 00:32:02 +00:00
Eddie Yan
8c0289a61c [CUDA][CUBLAS][BFloat16] Tenatively disable reduced precision reductions for some matmul tests (#92599)
We've observed some failures in numerical checks on newer compute capabilities stemming from cuBLAS allowing reduced precision reductions.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92599
Approved by: https://github.com/ngimel
2023-01-20 22:19:11 +00:00
Fang Wang
160118d72a Add test case for matrix multiply-add with large inputs (#85550)
Summary:
- Added test case for addmm, baddbmm and linear with large inputs
- Testing with torch types: float32, float16, bfloat16

Test Plan:
Run unit tests with:
`buck2 run mode/opt //caffe2/test:linalg_re_cuda`

```
...
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok

----------------------------------------------------------------------
Ran 24 tests in 63.224s

OK (skipped=12)
```

Differential Revision: D39718256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550
Approved by: https://github.com/IvanYashchuk, https://github.com/malfet
2022-10-11 17:52:21 +00:00