Fixes#68972
Relands #107246
To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088
Approved by: https://github.com/ezyang
Summary:
Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8
According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.
See table below for supported input and output types:
| Mat1 type | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3 | Float8_e4m3 | Float16 | Float8_e4m3, Float16 |
| Float8_e4m3 | Float8_e4m3 | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2 | Float8_e4m3 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e5m2 | Float8_e4m3 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3 | Float8_e5m2 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Not supported | Not supported |
Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
register_decomposition(aten._scaled_mm)
def _scaled_mm(
mat1: Tensor,
mat2: Tensor,
*,
dtype: Optional[torch.dtype] = None,
scale_a: Optional[Tensor] = None,
scale_b: Optional[Tensor] = None,
scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
rc = scale_a * rc if scale_a is not None else rc
rc = scale_b * rc if scale_b is not None else rc
rc = scale_result * rc if scale_result is not None else rc
rc = rc.to(dtype if dtype is not None else mat1.dtype)
return rc, torch.tensor(0.0, device=mat1.device)
```
Known limitations:
- Only works for matrix sizes divisible by 16
- 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)
Test Plan: Tests in test_matmul_cda.py
Differential Revision: D48415871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341
Approved by: https://github.com/vkuzo
According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.
See table below for supported input and output types:
| Mat1 type | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3 | Float8_e4m3 | Float16 | Float8_e4m3, Float16 |
| Float8_e4m3 | Float8_e4m3 | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2 | Float8_e4m3 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e5m2 | Float8_e4m3 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Float16 | Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3 | Float8_e5m2 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3 | Float8_e5m2 | Not supported | Not supported |
Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
@register_decomposition(aten._scaled_mm)
def _scaled_mm(
mat1: Tensor,
mat2: Tensor,
*,
dtype: Optional[torch.dtype] = None,
scale_a: Optional[Tensor] = None,
scale_b: Optional[Tensor] = None,
scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
rc = scale_a * rc if scale_a is not None else rc
rc = scale_b * rc if scale_b is not None else rc
rc = scale_result * rc if scale_result is not None else rc
rc = rc.to(dtype if dtype is not None else mat1.dtype)
return rc, torch.tensor(0.0, device=mat1.device)
```
Known limitations:
- Only works for matrix sizes divisible by 16
- 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844
Approved by: https://github.com/albanD
ghstack dependencies: #106977
Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~
According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/
CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975
Approved by: https://github.com/ngimel
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.
CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214
Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200.
Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called.
We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now.
CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201
Approved by: https://github.com/ngimel