pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
drisspg	a469aca1cc	Exposes a fast_fp8_accum option to _scaled_mm (#111847 ) # Summary Adds the option to use fast_accumulation_mode for the fp8 matmul in scaled_mm Information can be found here: https://docs.nvidia.com/cuda/cublas/#cublasltmatmuldescattributes-t defaults to 0 (off) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111847 Approved by: https://github.com/ipiszy, https://github.com/malfet	2023-10-24 03:26:53 +00:00
Christian Puhrsch	3553eb9b89	Add CUTLASS-based support for mixed dtypes matrix multiplication (#110981 ) Resubmission without ghstack to make it easier to import https://github.com/pytorch/pytorch/pull/110934/commits Pull Request resolved: https://github.com/pytorch/pytorch/pull/110981 Approved by: https://github.com/drisspg	2023-10-11 21:47:52 +00:00
drisspg	09a17c512d	Add better error messaging to scaled_mm (#108454 ) Fixes #108411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108454 Approved by: https://github.com/vkuzo	2023-09-07 21:26:47 +00:00
Kurt Mohler	3f88e3105f	Reland: Remove remaining global `set_default_dtype` calls from tests (#108088 ) Fixes #68972 Relands #107246 To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088 Approved by: https://github.com/ezyang	2023-09-07 03:04:34 +00:00
drisspg	d5ff8ca4ef	Relax divsibilty by 16 for leading dimension of mat1 in scaled_gemm (#108308 ) # Summary CublasLT requires that the matrices be 16 byte aligned. If mat1.size(-1) % 16 == 0 and the matrix is row major than the leading dimension can be any value. See this coment: https://github.com/pytorch/pytorch/pull/107341#discussion_r1310934737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108308 Approved by: https://github.com/eqy, https://github.com/vkuzo	2023-08-31 20:31:47 +00:00
drisspg	00eed6f367	Better Error Message for invalid Out_dtype + Bias for scaled_mm (#108097 ) # Summary Fixes an error case that was directly throwing Cublasslt error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108097 Approved by: https://github.com/vkuzo	2023-08-29 04:10:17 +00:00
PyTorch MergeBot	161ea463e6	Revert "Remove remaining global `set_default_dtype` calls from tests (#107246 )" This reverts commit `aa8ea1d787`. Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))	2023-08-25 19:34:55 +00:00
Kurt Mohler	aa8ea1d787	Remove remaining global `set_default_dtype` calls from tests (#107246 ) Fixes #68972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246 Approved by: https://github.com/ezyang	2023-08-24 16:10:48 +00:00
drisspg	c093fdf924	Fix wrong hardcoded value for _scaled_mm (#107719 ) ## Summary Sneaky lil bug where we were accidentally fusing in relu to the epilogue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107719 Approved by: https://github.com/vkuzo	2023-08-22 21:52:20 +00:00
Driss Guessous	8ccfd801be	Introduce CUDA-only `_scaled_mm` op (#107341 ) Summary: Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8 According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed. Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix. See table below for supported input and output types: \| Mat1 type \| Mat2 type \| Bias type \| Output types \| \| ----------- \| ----------- \| ----------- \| ----------- \| \| Float8_e4m3 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float16 \| \| Float8_e4m3 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, BFloat16, Float \| \| Float8_e5m2 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e5m2 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e4m3 \| Float8_e5m2 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Not supported \| Not supported \| Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following: ```python register_decomposition(aten._scaled_mm) def _scaled_mm( mat1: Tensor, mat2: Tensor, , dtype: Optional[torch.dtype] = None, scale_a: Optional[Tensor] = None, scale_b: Optional[Tensor] = None, scale_result: Optional[Tensor] = None, ) -> Tuple[Tensor, Tensor]: rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32)) rc = scale_a rc if scale_a is not None else rc rc = scale_b * rc if scale_b is not None else rc rc = scale_result * rc if scale_result is not None else rc rc = rc.to(dtype if dtype is not None else mat1.dtype) return rc, torch.tensor(0.0, device=mat1.device) ``` Known limitations: - Only works for matrix sizes divisible by 16 - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work) Test Plan: Tests in test_matmul_cda.py Differential Revision: D48415871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341 Approved by: https://github.com/vkuzo	2023-08-17 21:24:43 +00:00
PyTorch MergeBot	1af324b560	Revert "Introduce CUDA-only `_scaled_mm` op (#106844 )" This reverts commit `9440a8cbec`. Reverted https://github.com/pytorch/pytorch/pull/106844 on behalf of https://github.com/izaitsevfb due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/106844#issuecomment-1679858327))	2023-08-16 02:05:29 +00:00
Nikita Shulga	9440a8cbec	Introduce CUDA-only `_scaled_mm` op (#106844 ) According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed. Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix. See table below for supported input and output types: \| Mat1 type \| Mat2 type \| Bias type \| Output types \| \| ----------- \| ----------- \| ----------- \| ----------- \| \| Float8_e4m3 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float16 \| \| Float8_e4m3 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, BFloat16, Float \| \| Float8_e5m2 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e5m2 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e4m3 \| Float8_e5m2 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Not supported \| Not supported \| Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following: ```python @register_decomposition(aten._scaled_mm) def _scaled_mm( mat1: Tensor, mat2: Tensor, , dtype: Optional[torch.dtype] = None, scale_a: Optional[Tensor] = None, scale_b: Optional[Tensor] = None, scale_result: Optional[Tensor] = None, ) -> Tuple[Tensor, Tensor]: rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32)) rc = scale_a rc if scale_a is not None else rc rc = scale_b * rc if scale_b is not None else rc rc = scale_result * rc if scale_result is not None else rc rc = rc.to(dtype if dtype is not None else mat1.dtype) return rc, torch.tensor(0.0, device=mat1.device) ``` Known limitations: - Only works for matrix sizes divisible by 16 - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844 Approved by: https://github.com/albanD ghstack dependencies: #106977	2023-08-15 02:59:41 +00:00
rraminen	239578beff	[ROCm] Enable a few bfloat16 unit tests (#105177 ) Currently a few unit tests from test_matmul_cuda and test_sparse_csr test suites are being skipped on ROCm. This PR is to enable the following unit tests on ROCm (~30 UTs): test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA) test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2023-08-03 21:17:19 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Eddie Yan	5c5ad53517	[CUBLAS] Specify alignment for `cuBlasLt` `addmm` (#98975 ) Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~ According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/ CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975 Approved by: https://github.com/ngimel	2023-04-18 06:19:30 +00:00
blorange-amd	079452ea0f	Enable test_matmul_cuda UTs for ROCm (#98797 ) test_file \| test_name \| test_class -- \| -- \| -- test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_float32 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_float32 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_float32 \| (__main__.TestMatmulCudaCUDA) This PR is the same fix as https://github.com/pytorch/pytorch/pull/88888. Creating this new PR to sanitize the history. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98797 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2023-04-13 19:36:07 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
Eddie Yan	0bf7506051	[CUDA] Drop CUDA < 11.0 test flags (#92605 ) Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed. CC @ptrblck @malfet @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605 Approved by: https://github.com/ngimel	2023-01-24 04:34:06 +00:00
Eddie Yan	1af40d5108	[cublas][cublasLt] Fall back to unfused `addmm` for 2-byte-aligned inputs (#92201 ) Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214 Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200. Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called. We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201 Approved by: https://github.com/ngimel	2023-01-21 00:32:02 +00:00
Eddie Yan	8c0289a61c	[CUDA][CUBLAS][BFloat16] Tenatively disable reduced precision reductions for some matmul tests (#92599 ) We've observed some failures in numerical checks on newer compute capabilities stemming from cuBLAS allowing reduced precision reductions. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92599 Approved by: https://github.com/ngimel	2023-01-20 22:19:11 +00:00
Fang Wang	160118d72a	Add test case for matrix multiply-add with large inputs (#85550 ) Summary: - Added test case for addmm, baddbmm and linear with large inputs - Testing with torch types: float32, float16, bfloat16 Test Plan: Run unit tests with: `buck2 run mode/opt //caffe2/test:linalg_re_cuda` ``` ... test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok ---------------------------------------------------------------------- Ran 24 tests in 63.224s OK (skipped=12) ``` Differential Revision: D39718256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550 Approved by: https://github.com/IvanYashchuk, https://github.com/malfet	2022-10-11 17:52:21 +00:00

1 2 3

121 Commits