pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Adnan Akhundov	0bed0501fa	Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634 ) Summary: There has been some empirical evidence that, for (non-trivial) custom (user-written) Triton kernels, a register-spilling config yields the best result in auto-tuning. For this reason, we don't skip register-spilling config from auto-tuning of the custom Triton kernels. <details> <summary>An example of auto-tuning result with the register-spilling config outperforming others</summary> ``` BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.748896, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.723424, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 2.202656, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.748256, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.724896, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 2.201632, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.651664, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.846368, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.841792, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.651584, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.846432, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.841904, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.236448, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.484384, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.131168, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.236544, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.483648, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.131408, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.516112, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.737792, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.411632, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.515904, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.736608, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.409808, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.553536, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569792, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.892448, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.553584, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569568, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.892240, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.332928, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.922256, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.758400, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.333440, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.922336, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.758496, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.231648, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.639424, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.917952, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.230624, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.639168, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.917440, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.838080, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569184, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.614720, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.838048, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569472, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.615104, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.012128, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.861536, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.771584, nreg 255, nspill 134, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.012512, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.861024, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.771712, nreg 255, nspill 134, #shared-mem 40960 ``` </details> In the above, the winning config is `BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2`, although it has non-zero `nspill 28`. This is an example where we need to consider all configs, including the register-spilling ones, to obtain the best result from auto-tuning. In the worst case, this will just make auto-tuning longer, but can't regress the results. And, as the number of custom Triton kernels in the model is normally much smaller than the number of Inductor-generated ones, this should be acceptable. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119634 Approved by: https://github.com/oulgen	2024-02-11 02:13:25 +00:00
PyTorch MergeBot	3ab08946d5	Revert "[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 )" This reverts commit `0597dab523`. Reverted https://github.com/pytorch/pytorch/pull/119448 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119448#issuecomment-1937345167))	2024-02-10 23:04:36 +00:00
PyTorch MergeBot	d8e319a961	Revert "[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 )" This reverts commit `760056bbdc`. Reverted https://github.com/pytorch/pytorch/pull/119491 on behalf of https://github.com/DanilBaibak due to Reverted as a dependency for #119448 ([comment](https://github.com/pytorch/pytorch/pull/119491#issuecomment-1937344548))	2024-02-10 23:02:05 +00:00
Peter Bell	c0f1183eb4	[inductor] Fix compile error on scan with no mask (#119555 ) Fixes #119591 Currently this results in invalid syntax: ```python tmp4 = tl.where(, tmp1, tmp2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119555 Approved by: https://github.com/lezcano	2024-02-10 12:38:40 +00:00
Yang Chen	760056bbdc	[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 ) This PR moved CppWrapperCodeGen class into a seperate file, cpp_wrapper.py, to simplify wrapper.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/119491 Approved by: https://github.com/desertfire, https://github.com/albanD	2024-02-10 02:15:56 +00:00
Yang Chen	0597dab523	[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 ) wrapper.py is getting more complex. Let's first split it into smaller pieces. Will have another PR to move CppWrapperCodeGen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119448 Approved by: https://github.com/desertfire	2024-02-09 20:18:04 +00:00
Elias Ellison	bf8a5a11be	Fix Inductor CSE Across Separate Reductions (#119410 ) We were CSE'ing a load across two separate reduction loop bodies. This is because we were examining an indirect indexing that did not have an explicit rindex in its load. I've commented with more details and other potentials on the fix. Tried using minifier unsuccessfully and hand minified some but could do more.. Fix for https://github.com/pytorch/pytorch/issues/119327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119410 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-02-09 19:34:57 +00:00
Kai Londenberg	5d81ade484	[Inductor max autotune] Multithreaded Precompilation (#119386 ) When using the Cutlass backend, the compilation of CUDA source files can totally dominate the runtime required for the benchmarking done as part of Autotuning. This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a possible on-disk sccache ). Also it ensures that no unneccessary compilation and benchmarking steps are performed, which was peviously the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386 Approved by: https://github.com/aakhundov	2024-02-09 16:11:30 +00:00
Jiong Gong	a050d146b7	[Inductor] Add Int8 data type into Inductor CPP backend vectorized code generation (#119179 ) Summary Part 1 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type. In the current implementation for quantization, the vectorized code generation only supports the `uint8` data type. In this PR, we introduce support for the `int8` data type within the vectorized code generation. TestPlan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_dequant_relu_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_quant_lowering_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_maxpool2d_lowering_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_per_tensor_fake_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_non_contiguous_load_buf_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_relu_quant_dequant_relu_quant_lowering_int8 ``` Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119179 Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/jansel	2024-02-09 07:33:12 +00:00
Peter Bell	88429a8084	[inductor] Add split scan kernel (#117992 ) This PR adds a new type of triton kernel in which data is persistent but the reduction dimension is split over multiple blocks (up to the entire kernel). though this is called a reduction dimension, in actuality we only support scans. because of this limitation, i have to be able to block fusions of split scan operations with reductions so chose to add a new `ir.SplitScan` node which is identical but allows for differentiation in the scheduler. The split scan kernel is also the first to require an additional workspace buffer which is used to communicate between cuda blocks. this is slightly tricky as we the exact scratch space requirement isn't known until the grid size is calculated. here i workaround the issue by setting a minimum rblock size and always allocating to the maximum possible grid size for a given input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992 Approved by: https://github.com/jansel ghstack dependencies: #117991	2024-02-09 01:56:00 +00:00
Peter Bell	01edb8a559	[inductor] Refactor triton range_tree handling (#117991 ) Currently the dimension handling in triton kernels has various special cases e.g. - handling "r" for non-reduction vs persistent reduction vs non-persistent reduction. - handling "x" when `no_x_dim` is set This adds three new properties to the range tree objects which capture the same information in a more generic way: - `is_loop`: true for the "r" dimension of a non-persistent reduction - `tensor_dim`: Optional index of the triton tensor dimension - `grid_dim`: Optional index of the triton grid dimension The motivation here is I want to add a new split scan kernel type which is: - not a persistent reduction, yet has `is_loop=False` for the "r" dimension - Has a `grid_dim` for the "r" dimension These flags now only need to be set once in `initialize_range_trees`, instead of having to infer them throughout the code based on the tree prefix and various other kernel flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117991 Approved by: https://github.com/lezcano	2024-02-09 01:56:00 +00:00
Yang Chen	9f8ade04cc	[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code (#119220 ) In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220 Approved by: https://github.com/hl475, https://github.com/desertfire	2024-02-08 21:57:27 +00:00
Pearu Peterson	7ec6ac89e8	Add lowering to special.modified_bessel_i0 (#118993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118993 Approved by: https://github.com/peterbell10	2024-02-08 18:42:40 +00:00
Jiong Gong	896cf9d1ce	[inductor][cpp] vectorization support for int32/int64 (#119001 ) This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details: 1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified. 2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs. 3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion. 4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages. Next steps: - [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors. - [ ] Fully utilize vector lanes for bfloat16/float16/int8. - [ ] Support indirect indexing with vectorized index via scalarization. - [ ] Clean up `CppVecKernelChecker`. - [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119001 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-02-08 17:38:49 +00:00
PyTorch MergeBot	088d538a8d	Revert "[Inductor] GEMM shape padding improvements (#118522 )" This reverts commit `cc46829f96`. Reverted https://github.com/pytorch/pytorch/pull/118522 on behalf of https://github.com/eellison due to regresses HF ~4/5% ([comment](https://github.com/pytorch/pytorch/pull/118522#issuecomment-1932557670))	2024-02-07 17:42:14 +00:00
Bin Bao	40ec155e58	[AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066 ) Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle. Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066 Approved by: https://github.com/khabinov	2024-02-07 16:54:00 +00:00
Bin Bao	e868a7fedd	[AOTI] Rename config.aot_inductor.abi_compatible (#119065 ) Summary: Rename config.aot_inductor.abi_compatible to config.abi_compatible, since the cpp_wrapper mode in JIT Inductor will share the same flag. Differential Revision: [D53478752](https://our.internmc.facebook.com/intern/diff/D53478752) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119065 Approved by: https://github.com/khabinov	2024-02-07 00:14:33 +00:00
Colin Peppler	7d7a3f0b37	[inductor] Support sympy.expr in user-defined Triton kernel grid fn (#119165 ) ## Problem A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`. Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below. ## Approach Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`. Differential Revision: D53367012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165 Approved by: https://github.com/aakhundov	2024-02-06 08:39:55 +00:00
Andrew M. James	884b6d2a67	[inductor] Implementing missing magic methods on IR values. (#118933 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118933 Approved by: https://github.com/peterbell10	2024-02-06 05:50:26 +00:00
Colin Peppler	3829b55416	[inductor] Support ProxyExecutor argument codegen for sympy.Expr (#119166 ) Differential Revision: D53398312 ## Problem Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead. ``` # What we see aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...); # What we want aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...) ``` ## Approach Use C++ wrapper's expression printer to handle this conversion Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166 Approved by: https://github.com/aakhundov	2024-02-06 00:33:25 +00:00
Shunting Zhang	fd0bf96c2b	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-05 23:35:41 +00:00
PyTorch MergeBot	b964a1222c	Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813 )" This reverts commit `c24ffc3f66`. Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1927877102))	2024-02-05 19:25:39 +00:00
Yang Chen	b2e0f8d82d	[mypy] added type annotations to codegen_nodes methods (#119080 ) added correct type annotations to scheduler and backends' codegen_nodes methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/119080 Approved by: https://github.com/eellison	2024-02-05 18:33:52 +00:00
Edward Z. Yang	abc09b27b9	Some minor type stub improvements (#118529 ) I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529 Approved by: https://github.com/Skylion007	2024-02-04 00:19:00 +00:00
Shunting Zhang	c24ffc3f66	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-03 00:06:21 +00:00
Pearu Peterson	a69016a741	Add lowering to special.bessel_j1 (#118992 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118992 Approved by: https://github.com/peterbell10	2024-02-02 20:16:08 +00:00
Bin Bao	c7ba5f6c6f	[AOTI] Fix a cpp kernel missing arg type issue (#119021 ) Summary: The current way of fetching the kernel arg types only works for tensors, not symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119021 Approved by: https://github.com/aakhundov, https://github.com/hl475, https://github.com/khabinov	2024-02-02 20:11:58 +00:00
Bin Bao	0e5fe4b3ae	[AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963 ) Summary: generate_index_put_fallback currently generates something like the following, ``` AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)}; ``` The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault. Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963 Approved by: https://github.com/aakhundov	2024-02-02 16:49:45 +00:00
Kai Londenberg	cc46829f96	[Inductor] GEMM shape padding improvements (#118522 ) Improvements to shape padding logic in torch/_inductor/pad_mm.py These changes could lead up to 14% perf improvement for certain Meta internal models in experiments. Most notably: * 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal ) * 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True). * 3.) Enable shape padding for the Inductor CUDA / Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case. * Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False ) * Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118522 Approved by: https://github.com/jansel, https://github.com/eellison	2024-02-02 08:50:06 +00:00
Colin Peppler	babd6c776d	[inductor] skip launching kernels with zero grid in AOTInductor when using backed symints (#118654 ) Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512) ### Why? Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`. To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints. This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model. ### Test ``` $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols OK (skipped=3) $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols # Before Error: CUDA driver error: invalid argument FAILED (errors=2, skipped=3) # Now OK (skipped=3) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-02-02 03:19:52 +00:00
PyTorch MergeBot	796278b57e	Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813 )" This reverts commit `20484a1936`. Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to broke linux-focal-rocm5.7-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1922613135))	2024-02-02 01:19:19 +00:00
PyTorch MergeBot	dbba1d4bf5	Revert "Some minor type stub improvements (#118529 )" This reverts commit `c978f38bd4`. Reverted https://github.com/pytorch/pytorch/pull/118529 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118529#issuecomment-1922362331))	2024-02-01 22:18:36 +00:00
Yang Chen	61b572ed56	[inductor] more accurate throughput calculations for kernel benchmarks (#118858 ) Our current throughput calculations for kernel benchmarks have some issues, particularly when we slice inputs in the kernel. In such cases, we count the original inputs as part of the memory traffic passed across the kernel. This is incorrect because it may result in a much larger throughput calculation, which can even exceed the theoretical bandwidth. Instead, we should only count the size of the "slices" that contribute to the actual memory traffic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858 Approved by: https://github.com/jansel	2024-02-01 21:42:14 +00:00
Shunting Zhang	20484a1936	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-01 21:29:02 +00:00
Andrew M. James	9c2b43cc50	[inductor] Handle special values correctly in ir.Scan codegen (#118788 ) Special values (`NaN`/`+/-Inf`) are not correctly during codegen for `ir.Scan` nodes. This is a fairly minor bugfix that has not come up since the only two scan ops with lowerings use "normal" values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118788 Approved by: https://github.com/peterbell10	2024-02-01 14:54:20 +00:00
Mu-Chu Lee	2b48891e62	[AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765 ) Summary: Add Runtime Constant-folding for AOTInductor. This also include the invocation of constant folding at load time. The constant folding lowering is a 2-step process. First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code. Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module. Test Plan: Unit tests included in commit. Differential Revision: D53274382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765 Approved by: https://github.com/chenyang78	2024-02-01 04:54:25 +00:00
Edward Z. Yang	c978f38bd4	Some minor type stub improvements (#118529 ) I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529 Approved by: https://github.com/Skylion007	2024-01-31 20:56:56 +00:00
hodavand	8026534a2f	Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929 ) Fixes #117370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117929 Approved by: https://github.com/Skylion007, https://github.com/desertfire	2024-01-31 19:34:58 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
Jason Ansel	e332653eb3	[inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490 Approved by: https://github.com/desertfire	2024-01-30 21:03:19 +00:00
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Pearu Peterson	2327879fb6	Add lowering to special.bessel_j0 (2nd try) (#118565 ) This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565 Approved by: https://github.com/peterbell10	2024-01-30 15:26:59 +00:00
Jiong Gong	e5bb527d3e	[inductor][cpp] support scalar value in vec reduction (#118511 ) Fix https://github.com/pytorch/pytorch/issues/118379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118511 Approved by: https://github.com/leslie-fang-intel, https://github.com/lezcano, https://github.com/jansel	2024-01-30 13:07:43 +00:00
Colin Peppler	8be6dee14b	[inductor] Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569 ) Summary: ### Context It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`. * First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()` * Second in `self.codegen_kwargs()`. When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration. ``` auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); ... // There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0. // And there's no reference to tmp_tensor_handle_0. // Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't // automatically cleaned-up like RAIIAtenTensorHandle CUdeviceptr var_6; aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void*>(&var_6)); void kernel_args_var_2[] = {..., &var_6, ...}; launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2); ``` ### Solution We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`. Test Plan: ### Inspect device memory allocated ``` # Before diff 0 device memory 2048 1 device memory 2560 2 device memory 3072 3 device memory 3584 4 device memory 4096 5 device memory 4608 # With diff (memory usage doesn't grow) 0 device memory 1536 1 device memory 1536 2 device memory 1536 3 device memory 1536 4 device memory 1536 5 device memory 1536 ``` Reviewed By: jingsh, tissue3 Differential Revision: D53190934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569 Approved by: https://github.com/oulgen	2024-01-30 05:19:32 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Jiong Gong	04c1df651a	[inductor][cpp] enable vectorization with constant bool (#118380 ) Related model DebertaForQuestionAnswering etc. For DebertaForQuestionAnswering, single thread, measured on ICX: Before: 0.990x, After: 1.043x Pull Request resolved: https://github.com/pytorch/pytorch/pull/118380 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-01-29 13:31:22 +00:00
leslie-fang-intel	ee3dfbbe47	[Inductor] Fix Argmax codegen with Nan input (#118358 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/118266, current `torch.argmax` and `torch.argmin` has different return values with eager and Inductor cpp backend when inputs has `Nan` value. Align cpp backend results to eager by reusing the compare function. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_argmin_cpu_only python -u -m pytest -s -v test_cpu_repro.py -k test_argmax_argmin_with_nan_value ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118358 Approved by: https://github.com/lezcano, https://github.com/jgong5, https://github.com/jansel	2024-01-29 09:09:46 +00:00
Edward Z. Yang	2951bbf0f7	Add some type annotations to torch._inductor.codegen.wrapper (#118491 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118491 Approved by: https://github.com/Skylion007	2024-01-29 06:17:27 +00:00
Edward Z. Yang	cad79bd0bb	Remove follow_imports = skip from sympy (#118469 ) dmypy silently ignores follow_imports = skip, so to get parity between dmypy and mypy we have to suck it up and type: ignore all of the sympy typing problems. The suppressions were added automatically with the following script generated by GPT-4: ``` import re # Read the error file with open("error_file.txt", "r") as f: errors = f.readlines() # Parse the lines with errors and error types error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type # Insert ignore comments in the source files for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467, #118468	2024-01-28 13:38:38 +00:00

1 2 3 4 5 ...

771 Commits