pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Jiong Gong	3267814d53	[inductor] refactor: device dispatch inside do_bench (#125736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125736 Approved by: https://github.com/shunting314	2024-05-09 23:50:02 +00:00
Jiong Gong	8def2e92f2	[inductor] autotune benchmark support for cpu (#125159 ) This PR adds the autotune Infrastructure for CPU. It generalizes and extends `BenchmarkRequest` with CPU support and C++ module loader. A `do_bench_cpu` util function is added for benchmarking functions on CPU with warmups and returns the median number from multiple trials. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125159 Approved by: https://github.com/jansel	2024-05-09 00:28:49 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Kai Londenberg	10f673541e	[Inductor cutlass backend] Enabled nonzero workspace and Cutlass StreamK (#125406 ) Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406 Approved by: https://github.com/jansel	2024-05-05 15:28:45 +00:00
Aaron Gokaslan	e3b9b71684	[BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126 ) Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126 Approved by: https://github.com/ezyang	2024-04-28 21:44:33 +00:00
Kai Londenberg	f8f6c460cd	[Inductor max autotune] Make autotuning robust against very slow Kernels (#123932 ) If a Kernel does not return in a reasonable amount of time during autotuning, it can delay inductor compilation a lot. This change introduces soft / hard kill timeouts and a mechanism to kill Kernels being profiled in subprocesses if they take too long. Correspondingly, a few new config options are introduced within _inductor/config.py - all of them with inline docs. Test Plan: Existing tests within test_max_autotune.py and test_cutlass_backend.py ) cover the new codepaths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123932 Approved by: https://github.com/jansel ghstack dependencies: #121497, #123930	2024-04-23 11:56:15 +00:00
Jason Ansel	7fd8870e6b	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 18:46:24 +00:00
PyTorch MergeBot	0b90af0bf5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 )" This reverts commit `fcf28b0ad5`. Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
Jason Ansel	fcf28b0ad5	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 04:51:15 +00:00
Aaron Gokaslan	c5fafe9f48	[BE]: TRY002 - Ban raising vanilla exceptions (#124570 ) Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR. I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570 Approved by: https://github.com/ezyang	2024-04-21 22:26:40 +00:00
eellison	136f8378e1	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-19 17:03:33 +00:00
PyTorch MergeBot	2b82345e48	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `030bb13fe8`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))	2024-04-18 07:21:41 +00:00
eellison	030bb13fe8	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-18 01:22:13 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
PyTorch MergeBot	3f89f565bb	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `d68196e7ef`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))	2024-04-17 11:31:33 +00:00
eellison	d68196e7ef	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-17 02:30:46 +00:00
Yue Dong	e0c9764660	Back out "Precompile triton templates (#121998 )" (#123305 ) Summary: We are reverting #121998 because the change plus search-autotune-cache led to significant compilation time increase, causing stuck job detector to trigger and then kill the training job. Test Plan: CI tests Reviewed By: nmacchioni Differential Revision: D55712203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123305 Approved by: https://github.com/eellison, https://github.com/nmacchioni, https://github.com/xw285cornell	2024-04-04 16:05:10 +00:00
eellison	ebde6c72cb	Precompile triton templates (#121998 ) Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998 Approved by: https://github.com/jansel	2024-03-25 21:33:36 +00:00
Shunting Zhang	a3d4eaf253	[inductor] device guard for max autotune benchmark (#122479 ) Internal users reported that they get failure for max-autotune if tensors are not on device 0. It turns out that we may use tensors on device say 6 and run kernel on them at device 0. This PR enforces that we do benchmarking for max-autotune on the correct device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122479 Approved by: https://github.com/xintwfb, https://github.com/Chillee	2024-03-22 17:27:53 +00:00
PyTorch MergeBot	bce640709c	Revert "Precompile triton templates (#121998 )" This reverts commit `b8df2f0ca5`. Reverted https://github.com/pytorch/pytorch/pull/121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail `b8df2f0ca5` ([comment](https://github.com/pytorch/pytorch/pull/121998#issuecomment-2014003037))	2024-03-21 23:05:59 +00:00
eellison	b8df2f0ca5	Precompile triton templates (#121998 ) Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking. Triton benchmarking templates were emitted as : ``` @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation. ``` @triton_heuristics.template( num_stages=3, num_warps=8, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]}, inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'}, ) @triton.jit def triton_mm(arg_A, arg_B, out_ptr0): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998 Approved by: https://github.com/jansel ghstack dependencies: #121996, #120275, #121997	2024-03-21 17:04:53 +00:00
Hongtao Yu	5aa7f8646f	[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120742 ) Relanding https://github.com/pytorch/pytorch/pull/120639 + a fix to drop `matrix_instr_nonkdim` that does not align with `BLOCK_M` or `BLOCK_N` Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 0 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x. Before: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4% SingleProcess AUTOTUNE takes 8.1153 seconds ``` After: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2% SingleProcess AUTOTUNE takes 11.4076 seconds ``` Before: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6% SingleProcess AUTOTUNE takes 3.4052 seconds ``` After: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8% SingleProcess AUTOTUNE takes 11.3538 seconds ``` Before: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6% SingleProcess AUTOTUNE takes 9.0523 seconds ``` After: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2% SingleProcess AUTOTUNE takes 8.2225 seconds ``` Before: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7% SingleProcess AUTOTUNE takes 11.0074 seconds ``` After: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4% SingleProcess AUTOTUNE takes 14.9839 seconds ``` Reviewed By: xw285cornell, nmacchioni Differential Revision: D54203170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120742 Approved by: https://github.com/xw285cornell	2024-02-28 20:27:14 +00:00
PyTorch MergeBot	f36e00b8ce	Revert "[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639 )" This reverts commit `78f53a3f73`. Reverted https://github.com/pytorch/pytorch/pull/120639 on behalf of https://github.com/izaitsevfb due to breaking ROCm ([comment](https://github.com/pytorch/pytorch/pull/120639#issuecomment-1967585568))	2024-02-27 21:05:57 +00:00
Hongtao Yu	78f53a3f73	[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639 ) Summary: Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 32 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x. Similar changes has been done to the HSTU ragged attention kernel D53386525. Test Plan: Before: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4% SingleProcess AUTOTUNE takes 8.1153 seconds ``` After: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2% SingleProcess AUTOTUNE takes 11.4076 seconds ``` Before: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6% SingleProcess AUTOTUNE takes 3.4052 seconds ``` After: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8% SingleProcess AUTOTUNE takes 11.3538 seconds ``` Before: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6% SingleProcess AUTOTUNE takes 9.0523 seconds ``` After: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2% SingleProcess AUTOTUNE takes 8.2225 seconds ``` Before: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7% SingleProcess AUTOTUNE takes 11.0074 seconds ``` After: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4% SingleProcess AUTOTUNE takes 14.9839 seconds ``` Reviewed By: xw285cornell, nmacchioni Differential Revision: D54203170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120639 Approved by: https://github.com/xw285cornell, https://github.com/jansel	2024-02-27 18:16:33 +00:00
Kai Londenberg	5d81ade484	[Inductor max autotune] Multithreaded Precompilation (#119386 ) When using the Cutlass backend, the compilation of CUDA source files can totally dominate the runtime required for the benchmarking done as part of Autotuning. This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a possible on-disk sccache ). Also it ensures that no unneccessary compilation and benchmarking steps are performed, which was peviously the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386 Approved by: https://github.com/aakhundov	2024-02-09 16:11:30 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Shunting Zhang	99f7e721fe	[inductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-22 00:09:29 +00:00
PyTorch MergeBot	db35ccf463	Revert "[innductor] make inductor work with new triton compile interface (#115878 )" This reverts commit `bbded928b3`. Reverted https://github.com/pytorch/pytorch/pull/115878 on behalf of https://github.com/kit1980 due to Broke ROCm https://github.com/pytorch/pytorch/actions/runs/7282149837/job/19844618618 ([comment](https://github.com/pytorch/pytorch/pull/115878#issuecomment-1865369349))	2023-12-21 02:00:17 +00:00
Shunting Zhang	bbded928b3	[innductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-21 00:03:38 +00:00
Jez Ng	47e6cc4d22	Remove yet more type-ignores in dynamo/inductor (#114684 ) Probably the last big batch for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/114684 Approved by: https://github.com/Skylion007	2023-11-28 22:09:38 +00:00
Jez Ng	87925789ae	Make V.graph properly typed (#114025 ) Previously it lacked a type hint and so was treated as an Any type. This resulted in a lot of untyped code downstream as V.graph is referenced in many places in inductor code. I've typed it properly now as GraphLowering, and fixed the numerous type errors this surfaced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025 Approved by: https://github.com/eellison ghstack dependencies: #114013	2023-11-21 02:14:29 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
Adnan Akhundov	d7317d8a11	Fix size_hint call sites failing on unbacked SymInts (#110520 ) Summary: Unbacked SymInts can't get a `sizevars.size_hint` due to being data-dependent. #109893 has added a new `fallback` parameter to `sizevars.size_hint` to specify the fallback value in cases like unbacked SymInt. In this PR we add more of those. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110520 Approved by: https://github.com/jansel, https://github.com/ezyang	2023-10-14 08:10:09 +00:00
Jez Ng	c77dd684c9	Enable typechecking in _inductor/ir.py (#110112 ) I used a bunch of ignore-type comments, mostly due to https://github.com/pytorch/pytorch/issues/109963. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110112 Approved by: https://github.com/peterbell10	2023-10-07 04:19:38 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
Sam Larsen	4eada253e1	[inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning (#109500 ) Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109500 Approved by: https://github.com/eellison, https://github.com/shunting314	2023-09-21 14:29:30 +00:00
Ying Zhang	d8da2a7c85	Switch to CUDA event based profiling (#109338 ) In https://github.com/pytorch/pytorch/pull/107901, the CUDA event based profiling is changed to profiler based profiling to avoid counting CPU-side kernel launch overhead in final latency numbers. However, it turns out that torch.profile() is significantly slower than CUDA event which affects model compilation speed quite significantlly. This PR changes back to CUDA event based profiling. Follow-ups: * Try CUDA event profiling with CUDAGraphs; * Multi-GPU profiling; Pull Request resolved: https://github.com/pytorch/pytorch/pull/109338 Approved by: https://github.com/frank-wei	2023-09-17 06:04:41 +00:00
Sam Larsen	4a09ed5459	[inductor] Parallelize Max Autotune step 2: Use multiple GPUs (#109127 ) Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109127 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #109126	2023-09-14 00:37:39 +00:00
Sam Larsen	ce4283933f	[inductor] Parallelize Max Autotune step 1: refactor autotune_process (#109126 ) Summary: Step 1 in revamping subprocess autotune to support multiple GPUs. This diff just does some refactoring to autotune_process.py in order to prepare for the next diff: * Move all logic for managing the sub-process (like detecting sub-process crashes) into the TuningProcess class. * Use log.debug statements instead of print statements Test Plan: python test/inductor/test_max_autotune.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/109126 Approved by: https://github.com/shunting314, https://github.com/eellison	2023-09-14 00:37:39 +00:00
Ying Zhang	097fd43f8c	[Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931 ) This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901	2023-09-12 17:44:38 +00:00
Ying Zhang	b2d764ece0	[Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest (#107901 ) This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107901 Approved by: https://github.com/jansel, https://github.com/aakhundov, https://github.com/kadeng ghstack dependencies: #107802, #107847	2023-09-12 17:44:36 +00:00
PyTorch MergeBot	2039f30c06	Revert "[inductor] Parallelize Max Autotune step 1: Use Popen (#107982 )" This reverts commit `d685668003`. Reverted https://github.com/pytorch/pytorch/pull/107982 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/107982#issuecomment-1714818307))	2023-09-12 01:12:22 +00:00
PyTorch MergeBot	c36c2bfcb2	Revert "[inductor] Parallelize Max Autotune step 2: Use all GPUs (#107983 )" This reverts commit `2c61313ff3`. Reverted https://github.com/pytorch/pytorch/pull/107983 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/107983#issuecomment-1714816358))	2023-09-12 01:08:08 +00:00
Sam Larsen	2c61313ff3	[inductor] Parallelize Max Autotune step 2: Use all GPUs (#107983 ) Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107983 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #107982	2023-09-10 15:43:03 +00:00
Sam Larsen	d685668003	[inductor] Parallelize Max Autotune step 1: Use Popen (#107982 ) Summary: Step 1 in revamping subprocess autotune to support multiple GPUs: use Popen to create a new process with an entry point we control so we don't reinterpret the toplevel script. Test Plan: `python test/inductor/test_max_autotune.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107982 Approved by: https://github.com/eellison, https://github.com/shunting314	2023-09-10 15:43:03 +00:00
FFFrog	83b5027027	Enable Mypy Check in torch/_inductor/select_algorithm.py (#106701 ) Fixes #105230 to enable mypy check. ``` $ mypy torch/_inductor/select_algorithm.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106701 Approved by: https://github.com/eellison	2023-08-10 03:19:50 +00:00
Jouni K. Seppänen	186352a625	[inductor] Make autotune_process.py pass mypy (#105791 ) `TensorMeta.from_irnodes` handles either a single `IRNode` or a tuple or list of them. I tried to express this with overloading, but because this file is in MYPYNOFOLLOW, the `IRNode` subclasses become `Any`, which causes the overloads to be overlapping. This changes the type of the argument to `benchmark_in_sub_process` to the more specific `TritonTemplateCaller`, since that one has the `bmreq` member and existing docstrings indicate that only the triton template benchmark is handled. The `rand_strided` call caused a mypy error because the default value for device was a string. This is fixed by adding type hints to `rand_strided` in `torch/_dynamo/testing.py`. Likewise, the return value of `PyCodeCache.load_by_key_path` can be inferred from the type hint on `PyCodeCache.cache`. Fixes one part of #105230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105791 Approved by: https://github.com/jansel, https://github.com/Skylion007	2023-07-31 23:58:38 +00:00
Khushi Agrawal	64843993a4	[mypy] autotune_process.py (#105732 ) Follows: #105571 / #105230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105732 Approved by: https://github.com/eellison	2023-07-26 22:45:04 +00:00

1 2

56 Commits