Commit Graph

121 Commits

Author SHA1 Message Date
drisspg
a469aca1cc Exposes a fast_fp8_accum option to _scaled_mm (#111847)
# Summary
Adds the option to use fast_accumulation_mode for the fp8 matmul in scaled_mm

Information can be found here: https://docs.nvidia.com/cuda/cublas/#cublasltmatmuldescattributes-t
defaults to 0 (off)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111847
Approved by: https://github.com/ipiszy, https://github.com/malfet
2023-10-24 03:26:53 +00:00
Christian Puhrsch
3553eb9b89 Add CUTLASS-based support for mixed dtypes matrix multiplication (#110981)
Resubmission without ghstack to make it easier to import https://github.com/pytorch/pytorch/pull/110934/commits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110981
Approved by: https://github.com/drisspg
2023-10-11 21:47:52 +00:00
drisspg
09a17c512d Add better error messaging to scaled_mm (#108454)
Fixes #108411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108454
Approved by: https://github.com/vkuzo
2023-09-07 21:26:47 +00:00
Kurt Mohler
3f88e3105f Reland: Remove remaining global set_default_dtype calls from tests (#108088)
Fixes #68972

Relands #107246

To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
2023-09-07 03:04:34 +00:00
drisspg
d5ff8ca4ef Relax divsibilty by 16 for leading dimension of mat1 in scaled_gemm (#108308)
# Summary
CublasLT requires that the matrices be 16 byte aligned. If mat1.size(-1) % 16 == 0 and the matrix is row major than the leading dimension can be any value. See this coment: https://github.com/pytorch/pytorch/pull/107341#discussion_r1310934737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108308
Approved by: https://github.com/eqy, https://github.com/vkuzo
2023-08-31 20:31:47 +00:00
drisspg
00eed6f367 Better Error Message for invalid Out_dtype + Bias for scaled_mm (#108097)
# Summary
Fixes an error case that was directly throwing Cublasslt error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108097
Approved by: https://github.com/vkuzo
2023-08-29 04:10:17 +00:00
PyTorch MergeBot
161ea463e6 Revert "Remove remaining global set_default_dtype calls from tests (#107246)"
This reverts commit aa8ea1d787.

Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))
2023-08-25 19:34:55 +00:00
Kurt Mohler
aa8ea1d787 Remove remaining global set_default_dtype calls from tests (#107246)
Fixes #68972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246
Approved by: https://github.com/ezyang
2023-08-24 16:10:48 +00:00
drisspg
c093fdf924 Fix wrong hardcoded value for _scaled_mm (#107719)
## Summary
Sneaky lil bug where we were accidentally fusing in relu to the epilogue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107719
Approved by: https://github.com/vkuzo
2023-08-22 21:52:20 +00:00
Driss Guessous
8ccfd801be Introduce CUDA-only _scaled_mm op (#107341)
Summary:
Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8

According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.

See table below for supported input and output types:
| Mat1 type  | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3  | Float8_e4m3  | Float16  | Float8_e4m3, Float16 |
| Float8_e4m3  | Float8_e4m3  | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2  | Float8_e4m3  | Float16 |  Float8_e4m3, Float8_e5m2, Float16  |
| Float8_e5m2  | Float8_e4m3  | BFloat16 |  Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Float16 |  Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3  | Float8_e5m2  | BFloat16 |  Float8_e4m3, Float8_e5m2,  BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Not supported | Not supported |

Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
register_decomposition(aten._scaled_mm)
def _scaled_mm(
    mat1: Tensor,
    mat2: Tensor,
    *,
    dtype: Optional[torch.dtype] = None,
    scale_a: Optional[Tensor] = None,
    scale_b: Optional[Tensor] = None,
    scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
    rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
    rc = scale_a * rc if scale_a is not None else rc
    rc = scale_b * rc if scale_b is not None else rc
    rc = scale_result * rc if scale_result is not None else rc
    rc = rc.to(dtype if dtype is not None else mat1.dtype)
    return rc, torch.tensor(0.0, device=mat1.device)
```

Known limitations:
  - Only works for matrix sizes divisible by 16
  - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)

Test Plan: Tests in test_matmul_cda.py

Differential Revision: D48415871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341
Approved by: https://github.com/vkuzo
2023-08-17 21:24:43 +00:00
PyTorch MergeBot
1af324b560 Revert "Introduce CUDA-only _scaled_mm op (#106844)"
This reverts commit 9440a8cbec.

Reverted https://github.com/pytorch/pytorch/pull/106844 on behalf of https://github.com/izaitsevfb due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/106844#issuecomment-1679858327))
2023-08-16 02:05:29 +00:00
Nikita Shulga
9440a8cbec Introduce CUDA-only _scaled_mm op (#106844)
According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.

See table below for supported input and output types:
| Mat1 type  | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3  | Float8_e4m3  | Float16  | Float8_e4m3, Float16 |
| Float8_e4m3  | Float8_e4m3  | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2  | Float8_e4m3  | Float16 |  Float8_e4m3, Float8_e5m2, Float16  |
| Float8_e5m2  | Float8_e4m3  | BFloat16 |  Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Float16 |  Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3  | Float8_e5m2  | BFloat16 |  Float8_e4m3, Float8_e5m2,  BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Not supported | Not supported |

Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
@register_decomposition(aten._scaled_mm)
def _scaled_mm(
    mat1: Tensor,
    mat2: Tensor,
    *,
    dtype: Optional[torch.dtype] = None,
    scale_a: Optional[Tensor] = None,
    scale_b: Optional[Tensor] = None,
    scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
    rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
    rc = scale_a * rc if scale_a is not None else rc
    rc = scale_b * rc if scale_b is not None else rc
    rc = scale_result * rc if scale_result is not None else rc
    rc = rc.to(dtype if dtype is not None else mat1.dtype)
    return rc, torch.tensor(0.0, device=mat1.device)
```

Known limitations:
  - Only works for matrix sizes divisible by 16
  - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844
Approved by: https://github.com/albanD
ghstack dependencies: #106977
2023-08-15 02:59:41 +00:00
rraminen
239578beff [ROCm] Enable a few bfloat16 unit tests (#105177)
Currently a few unit tests from **test_matmul_cuda** and **test_sparse_csr** test suites are being skipped on ROCm.

This PR is to enable the following unit tests on ROCm (~30 UTs):

test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA)
test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-03 21:17:19 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Eddie Yan
5c5ad53517 [CUBLAS] Specify alignment for cuBlasLt addmm (#98975)
Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~

According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975
Approved by: https://github.com/ngimel
2023-04-18 06:19:30 +00:00
blorange-amd
079452ea0f Enable test_matmul_cuda UTs for ROCm (#98797)
test_file | test_name | test_class
-- | -- | --
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_float32 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_float32 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_float32 | (__main__.TestMatmulCudaCUDA)

This PR is the same fix as https://github.com/pytorch/pytorch/pull/88888. Creating this new PR to sanitize the history.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98797
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-04-13 19:36:07 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
Eddie Yan
0bf7506051 [CUDA] Drop CUDA < 11.0 test flags (#92605)
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.

CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
2023-01-24 04:34:06 +00:00
Eddie Yan
1af40d5108 [cublas][cublasLt] Fall back to unfused addmm for 2-byte-aligned inputs (#92201)
Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214

Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200.

Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called.

We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201
Approved by: https://github.com/ngimel
2023-01-21 00:32:02 +00:00
Eddie Yan
8c0289a61c [CUDA][CUBLAS][BFloat16] Tenatively disable reduced precision reductions for some matmul tests (#92599)
We've observed some failures in numerical checks on newer compute capabilities stemming from cuBLAS allowing reduced precision reductions.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92599
Approved by: https://github.com/ngimel
2023-01-20 22:19:11 +00:00
Fang Wang
160118d72a Add test case for matrix multiply-add with large inputs (#85550)
Summary:
- Added test case for addmm, baddbmm and linear with large inputs
- Testing with torch types: float32, float16, bfloat16

Test Plan:
Run unit tests with:
`buck2 run mode/opt //caffe2/test:linalg_re_cuda`

```
...
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok

----------------------------------------------------------------------
Ran 24 tests in 63.224s

OK (skipped=12)
```

Differential Revision: D39718256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550
Approved by: https://github.com/IvanYashchuk, https://github.com/malfet
2022-10-11 17:52:21 +00:00