pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Pearu Peterson	e918461377	Add instructions for generating optimal Triton kernel parameters of bsr_dense_addmm (#115504 ) As in the title. In addition, enable verbose output when executing the torch/sparse/_triton_ops_meta.py script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115504 Approved by: https://github.com/cpuhrsch ghstack dependencies: #115499	2023-12-12 16:44:51 +00:00
Pearu Peterson	32286512cc	Add tune_bsr_dense_addmm as an API to find optimal triton kernel parameters for bsr_dense_addmm (#115499 ) As in the title. In addition: - improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found - add tests and fix bugs Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499 Approved by: https://github.com/cpuhrsch	2023-12-12 16:44:51 +00:00
Pearu Peterson	12085914b8	Replace bsr_dense_mm triton kernel with bsr_dense_addm triton kernel (#115030 ) The `bsr_dense_addmm` triton kernel introduced in https://github.com/pytorch/pytorch/pull/114595 is a generalization of `bsr_dense_mm` triton kernel and a more efficient version of it because it uses an extra kernel parameter `SPLIT_N` that has notable effect to performance for r.h.s operand with a larger number of columns. This PR eliminates the `bsr_dense_mm` triton kernel in favor of using `bsr_dense_addmm` triton kernel. The performance increase of `bsr_dense_mm` is as follows (float16, `NVIDIA A100-SXM4-80GB`): - with 16x16 blocks, the average/maximal speed up is 50/71 % - with 32x32 blocks, the average/maximal speed up is 30/63 % - with 64x64 blocks, the average/maximal speed up is 12/26 % - with 128x128 blocks, the average/maximal speed up is 7/17 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/115030 Approved by: https://github.com/cpuhrsch	2023-12-05 22:29:24 +00:00
Pearu Peterson	4ba37e1804	Add tests for bsr_dense_addmm and bsr_dense_mm triton kernels (#114800 ) As in the title. In addition, - resolve https://github.com/pytorch/pytorch/pull/114757#discussion_r1409547917 re triton-contiguous inputs - support non-contiguous inputs and outputs in triton kernels - fix a couple of minor bugs Pull Request resolved: https://github.com/pytorch/pytorch/pull/114800 Approved by: https://github.com/cpuhrsch	2023-12-04 22:07:47 +00:00
Pearu Peterson	69c4819f53	Add bsr_dense_addmm triton kernel (#114595 ) As in the title. The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters): - it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below). - it supports rectangular blocks in sparse BSR tensor weights The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`): - with 16x16 blocks, the average/maximal speed up is 55/94 % - with 32x32 blocks, the average/maximal speed up is 33/63 % - with 64x64 blocks, the average/maximal speed up is 23/42 % - with 128x128 blocks, the average/maximal speed up is 15/39 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595 Approved by: https://github.com/cpuhrsch	2023-11-29 05:29:25 +00:00
Pearu Peterson	12f95df0e9	Eliminate unnecessary multiplications by 1 in addmm with sparse compressed tensor operand (#114026 ) This PR: - updates `torch/sparse/_triton_ops_meta.py` for the API change in `triton.testing.do_bench` - force `num_stages` to be 1 when blocksize is 128x128 to avoid out of resources exception when `bsr_dense_mm` is called from `nn.linear`. - as in the title. The performance of `nn.linear` on BSR tensor weights (dtypes `float16` and `bfloat16`) is increased as follows (`NVIDIA A100-SXM4-80GB`): - for blocksize 16x16, the average/maximum speed up is about 11/20 % - for blocksize 32x32, the average/maximum speed up is about 15/24 % - for blocksize 64x64, the average/maximum speed up is about 18/26 % - for blocksize 128x128, the average/maximum speed up is about 15/28 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/114026 Approved by: https://github.com/cpuhrsch	2023-11-19 12:13:54 +00:00
Pearu Peterson	e1c872e009	Add optimal triton kernel parameters to bsr_dense_mm and scatter_mm for bfloat16 and float32 dtypes (#113553 ) As in the title. This PR is a follow-up to PR https://github.com/pytorch/pytorch/pull/112737 to address bfloat16 and float32 dtype cases. The performance increase is as follows (`NVIDIA A100-SXM4-80GB`): - bsr_scatter_mm and bfloat16 - for blocksize 16x16, the average/maximum speed up is about 29/75 %. - for blocksize 32x32, the average/maximum speed up is about 23/58 %. - for blocksize 64x64, the average/maximum speed up is about 27/66 %. - for blocksize 128x128, the average/maximum speed up is about 33/72 %. - bsr_dense_mm and bfloat16 - for blocksize 16x16, the average/maximum speed up is about 47/61 %. - for blocksize 32x32, the average/maximum speed up is about 29/43 %. - for blocksize 64x64, the average/maximum speed up is about 21/41 %. - for blocksize 128x128, the average/maximum speed up is about 12/29 %. - bsr_dense_mm and float32 - for blocksize 16x16, the average/maximum speed up is about 35/49 %. - for blocksize 32x32, the average/maximum speed up is about 2/5 %. - for blocksize 64x64, the average/maximum speed up is about 2/21 %. - for blocksize 128x128, the average/maximum speed up is about 79/84 %. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113553 Approved by: https://github.com/cpuhrsch	2023-11-14 00:47:59 +00:00
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
Pearu Peterson	e64d250210	Add a tool for a semi-automatic optimization of bsr_dense_mm meta parameters. (#112737 ) Finding optimal meta parameters for bsr_dense_mm and bsr_scatter_mm triton kernels is a tedious job. This PR introduces a tool (a Python script `torch/sparse/_triton_ops_meta.py`) that finds the optimal set of meta parameters for a given set of matrix multiplication inputs and their block sizes. Currently, such a set is found for square bsr tensor inputs with sizes 256...16384 and square blocksizes 16...128, and dense tensor inputs with sizes 256...131072. As a result, bsr_dense_mm performance has increased as follows (`NVIDIA A100-SXM4-80GB`): - for blocksize 16x16, the average/maximum speed up is about 40/60 %. - for blocksize 32x32, the average/maximum speed up is about 28/45 %. - for blocksize 64x64, the average/maximum speed up is about 26/43 %. - for blocksize 128x128, the average/maximum speed up is about 12/28 %. To enable the performance improvements through meta parameter optimization for other CUDA devices, one must execute the `_triton_ops_meta.py` which will calculate the optimal meta parameters and store the results in a dictionary object defined in `_triton_ops_meta.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112737 Approved by: https://github.com/cpuhrsch	2023-11-05 12:52:09 +00:00

9 Commits