pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
etaf	7a6cb9fdfb	[Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020 ) As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend. ### Design Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation scattered in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code. For example: `2a44034895/torch/_inductor/codegen/wrapper.py (L487)` `2a44034895/torch/_inductor/codegen/triton.py (L1996)` So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility. Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-12-22 08:42:51 +00:00
Yifu Wang	7d0ad6e870	Make native c10d_functional ops work with AOTInductor (#113735 ) Summary: - Revised `c10d_functional` ops to conform to https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native#func - Modifed `get_cpp_op_schema()` to handle mutable args and aliasing returns Pull Request resolved: https://github.com/pytorch/pytorch/pull/113735 Approved by: https://github.com/desertfire ghstack dependencies: #113438	2023-12-22 08:12:13 +00:00
Shunting Zhang	99f7e721fe	[inductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-22 00:09:29 +00:00
PyTorch MergeBot	db35ccf463	Revert "[innductor] make inductor work with new triton compile interface (#115878 )" This reverts commit `bbded928b3`. Reverted https://github.com/pytorch/pytorch/pull/115878 on behalf of https://github.com/kit1980 due to Broke ROCm https://github.com/pytorch/pytorch/actions/runs/7282149837/job/19844618618 ([comment](https://github.com/pytorch/pytorch/pull/115878#issuecomment-1865369349))	2023-12-21 02:00:17 +00:00
Shunting Zhang	bbded928b3	[innductor] make inductor work with new triton compile interface (#115878 ) Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API. Also there is some simplification between compilation call in subprocess and the one in main process - previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that - previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process. Updated: There are more interface change from triton side. E.g. - tl.math.{min, max} now requires a propagate_nan argument - JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton. - triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878 Approved by: https://github.com/jansel	2023-12-21 00:03:38 +00:00
Bin Bao	a597a00c87	[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115972 ) Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR. This is a reland of https://github.com/pytorch/pytorch/pull/115831 Differential Revision: [D52290900](https://our.internmc.facebook.com/intern/diff/D52290900) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115972 Approved by: https://github.com/chenyang78	2023-12-20 03:22:03 +00:00
Oguz Ulgen	c55210b4f0	[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 ) Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries. Previously, we would see wrapper like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` now it looks like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849 Approved by: https://github.com/jansel	2023-12-20 00:25:32 +00:00
PyTorch MergeBot	c539f7df10	Revert "[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 )" This reverts commit `21b8127f1c`. Reverted https://github.com/pytorch/pytorch/pull/115849 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, please check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/115849#issuecomment-1863012933))	2023-12-19 15:47:55 +00:00
PyTorch MergeBot	d5115bfb06	Revert "[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831 )" This reverts commit `287a865677`. Reverted https://github.com/pytorch/pytorch/pull/115831 on behalf of https://github.com/desertfire due to rocm CI failure ([comment](https://github.com/pytorch/pytorch/pull/115831#issuecomment-1858322270))	2023-12-15 18:34:55 +00:00
Bin Bao	287a865677	[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831 ) Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR. Differential Revision: [D52189999](https://our.internmc.facebook.com/intern/diff/D52189999) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115831 Approved by: https://github.com/chenyang78	2023-12-15 14:40:44 +00:00
Bin Bao	7d4ccd7b9e	[AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766 ) Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766 Approved by: https://github.com/chenyang78 ghstack dependencies: #115783	2023-12-15 03:08:13 +00:00
Oguz Ulgen	21b8127f1c	[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849 ) Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries. Previously, we would see wrapper like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` now it looks like ``` def grid_wrapper_for_add_kernel_2d_autotuned_0(meta): if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1) if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849 Approved by: https://github.com/jansel	2023-12-14 23:26:04 +00:00
Scott Wolchok	81321baf5c	[PyTorch] Remove ArrayRefTensor::dtype (#113578 ) Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway. Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578 Approved by: https://github.com/khabinov, https://github.com/Neilblaze ghstack dependencies: #112800, #113577	2023-12-13 21:32:14 +00:00
Scott Wolchok	b9af126908	[PyTorch] Add input numel assert for minimal arrayref interface (#113577 ) We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface. Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113577 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #112800	2023-12-13 21:31:55 +00:00
Scott Wolchok	f9cf6ae889	[PyTorch] AOTI: add minimal arrayref interface (#112800 ) This implements an optional alternate interface to the AOTI generated DSO, intended to increase efficiency for models running on CPU and requiring minimal overhead. See comment in config.py for more explanation. This took a while to get right (e.g., I initially required 1-D MiniArrayRef<T> for the inputs, but found that multi-dimensional ArrayRefTensor<T> ended up simplifying the implementation and allowed test_aot_inductor.py to run) and is somewhat intricate, so I am anticipating that review will require some back-and-forth. Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800 Approved by: https://github.com/chenyang78	2023-12-13 12:06:35 +00:00
Scott Wolchok	2b323e61ad	[PyTorch] AOTI: Use static_cast, not dynamic_cast (#112798 ) dynamic_cast is for when we aren't certain about the type. We are certain (and will crash anyway if we're wrong). Differential Revision: [D50812978](https://our.internmc.facebook.com/intern/diff/D50812978/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112798 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel, https://github.com/khabinov ghstack dependencies: #112116, #112174, #112405	2023-12-12 06:19:45 +00:00
Scott Wolchok	ca52195112	[PyTorch] AOTI: Avoid aoti_torch_data_ptr calls for constants at inference time (#112405 ) Cache aoti_torch_get_data_ptr at constants update time. Differential Revision: [D50708982](https://our.internmc.facebook.com/intern/diff/D50708982/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112405 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov ghstack dependencies: #112116, #112174	2023-12-12 06:19:45 +00:00
Scott Wolchok	24c67fe8cf	[PyTorch] AOTI: Emit static constexpr int array vars when possible (#112174 ) No need to populate a stack-based array for a shape/stride array when it's statically known. Differential Revision: [D50699889](https://our.internmc.facebook.com/intern/diff/D50699889/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112174 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #112116	2023-12-12 06:19:45 +00:00
Scott Wolchok	ff6f987adc	[PyTorch] Replace cached thread_locals with stack allocation in AOTI (#112116 ) This changes cached thread_local tensors to stack-allocated buffers. Since we were incidentally caching output in a thread_local, I had to add manual thread_local caching of outputs, which I implemented by caching a buffer and a Tensor whose storage is that buffer and then just memcpying the result into the cached buffer every time. Ideally, memory planning would be able to identify allocations that are the backing storage for outputs, but this should be good enough in the absence of planning. Differential Revision: [D50416438](https://our.internmc.facebook.com/intern/diff/D50416438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112116 Approved by: https://github.com/jansel, https://github.com/desertfire	2023-12-12 06:19:45 +00:00
Bin Bao	2e6b809d6b	[AOTI] Fix a missing declaration for the result of item() (#115175 ) Differential Revision: [D51968539](https://our.internmc.facebook.com/intern/diff/D51968539) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115175 Approved by: https://github.com/chenyang78	2023-12-10 22:49:45 +00:00
Mu-Chu Lee	80527c0cf2	[AOTInductor] Double buffering for Weights (#114446 ) Summary: This adds function to model container doing weight swapping with double buffering. There are 2 parts for double buffering a) Write constants into inactive buffer b) Swap active buffer For (a), we write the constants into the buffer that's currently not in use, and store the information in both constants map and the corresponding constant array to read. For (b), we obtain the lock, and activate the constant map/constant array that is inactive, and flag the one that's currently in use to inactive. Test Plan: test/cpp/aot_inductor/test.cpp Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D51543732](https://our.internmc.facebook.com/intern/diff/D51543732) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114446 Approved by: https://github.com/chenyang78, https://github.com/eellison	2023-12-05 22:31:56 +00:00
Yang Chen	4d8b9964e1	[aotinductor] support at::convolution for AOTInductor (#114961 ) This PR adds support to at::convolution for AOTInductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/114961 Approved by: https://github.com/desertfire	2023-12-03 07:52:28 +00:00
Bin Bao	8a90249bc2	[inductor] Update triton pin (#114772 ) Differential Revision: [D51761353](https://our.internmc.facebook.com/intern/diff/D51761353) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114772 Approved by: https://github.com/shunting314, https://github.com/atalman	2023-12-02 19:13:56 +00:00
chilli	1f51f977ae	misc visualization/utility improvements (#114984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114984 Approved by: https://github.com/weifengpy ghstack dependencies: #114520	2023-12-02 04:02:39 +00:00
Jez Ng	f1fd02503b	Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974 ) This is a backout of #113747 which reverted the above two commits. Now that #113997 has landed, this diff can be landed safely without breaking ABI compatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974 Approved by: https://github.com/chenyang78	2023-12-02 03:25:51 +00:00
Mu-Chu Lee	a9aad4ea21	[AOTInductor] Generate Triton header even if scheduler is not invoked. (#114972 ) Summary: Generate Triton header for profiling. If Triton header isn't generated through Scheduler, generate it directly when in wrapper codegen. Test Plan: Test included in commit. (test_aot_inductor.py:test_with_no_triton_profiler) Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114972 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-12-02 02:03:38 +00:00
chunyuan	e3c42d3fb3	Inductor cpp wrapper: fix buffer free in non-AOT mode (#114741 ) We found performance regression when using cpp wrapper in non-AOT mode due to the change in https://github.com/pytorch/pytorch/pull/110892. https://github.com/pytorch/pytorch/pull/110892 only handles the buffer cache in AOT mode but removes the `reset` call without checking whether AOT mode is on or off. This PR updates the buffer free change to only happen when `V.graph.aot_mode is True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114741 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-11-30 16:46:55 +00:00
colinpeppler	5262484ece	[easy][aotinductor] fix typos & add static typing (#114728 ) ``` // check all references $ grep -rl 'cpp_kernel_overlad_name' * ir.py ``` ``` $ lintrunner --take MYPYINDUCTOR torch/_inductor/codegen/wrapper.py torch/_inductor/ir.py ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114728 Approved by: https://github.com/Skylion007, https://github.com/chenyang78	2023-11-30 02:10:56 +00:00
Jack Taylor	4a4c9fb0b8	[ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141 ) Follows from previous enablement attempt: https://github.com/pytorch/pytorch/pull/101797 Adds support for hsaco binaries in inductor's cpp_wrapper codegen and enables the CUDA tests in test_cpp_wrapper. This PR also brings in additional required hipify mappings for the wrapper codegen file. NOTE: we can unskip some of these tests when we enabled MI210 runners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105141 Approved by: https://github.com/jansel, https://github.com/malfet	2023-11-29 15:11:24 +00:00
Scott Wolchok	5b9add666f	[PyTorch] AOTI: Emit CACHED_TORCH_TYPE only as needed (#113997 ) Avoids potential compatibility issues where a new dtype is supported by the DSO but not the binary loading it. Differential Revision: [D51434335](https://our.internmc.facebook.com/intern/diff/D51434335/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113997 Approved by: https://github.com/int3	2023-11-29 03:12:32 +00:00
Adnan Akhundov	0a063ad2c0	[inductor] Pass None and skip constexpr in custom Triton kernel calls from C++ (#114475 ) Summary: `None` arguments are codegened as `*i8` in the `triton_meta` of the generated or user-defined Triton kernels: `85aa372374/torch/_inductor/codegen/triton_utils.py (L33-L36)` Due to this, in contrary to the conventional Triton, we actually should pass `nullptr` to the Triton kernels in C++ wrapper codegen instead of passing nothing (as normally `None` doesn't make it to the generated PTX parameters, just like `tl.constexpr` args). This PR adds two things: 1. Proper C++ wrapper codegening (ABI and non-ABI) of `nullptr` and `c10::nullopt`, as the prior way codegening `c10::nullopt` as tensor breaks (also `c10` breaks in the ABI mode). 2. Skipping `tl.constexpr` args when calling the loaded-from-cubin compiled Triton kernel in the C++ wrapper codegen. As a side effect, this also resolves an issue with string arguments: now they are simply omitted in the C++ wrapper codegen. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_triton_kernel_with_none_input ... ---------------------------------------------------------------------- Ran 4 tests in 40.364s OK (skipped=2) ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114475 Approved by: https://github.com/oulgen	2023-11-24 12:51:56 +00:00
Yang Chen	ebeaec71bf	[aotinductor] don't generate python profiling code in the cpp world (#114182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114182 Approved by: https://github.com/aakhundov, https://github.com/desertfire	2023-11-21 21:11:58 +00:00
Oguz Ulgen	ef90508f75	[AOTI] Support ReinterpretView in abi mode (#114169 ) https://github.com/pytorch/pytorch/pull/113967 added support for ReinterpretView but it turnes out we codegen it differently in abi compat mode. This PR adds support for abi compat mode as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114169 Approved by: https://github.com/aakhundov	2023-11-21 17:08:00 +00:00
Jez Ng	87925789ae	Make V.graph properly typed (#114025 ) Previously it lacked a type hint and so was treated as an Any type. This resulted in a lot of untyped code downstream as V.graph is referenced in many places in inductor code. I've typed it properly now as GraphLowering, and fixed the numerous type errors this surfaced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025 Approved by: https://github.com/eellison ghstack dependencies: #114013	2023-11-21 02:14:29 +00:00
Adnan Akhundov	ae00d9623e	[inductor] Add ABI shim function for torch.scatter (#114027 ) Summary: Scatter fallback calls `at::scatter` in the C++ wrapper codegen. This doesn't work in the ABI compatibility mode, as the latter requires a shim function. One is added in this PR. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_scatter_fallback s... ---------------------------------------------------------------------- Ran 4 tests in 52.713s OK (skipped=1) ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114027 Approved by: https://github.com/chenyang78, https://github.com/desertfire ghstack dependencies: #114024	2023-11-20 22:51:59 +00:00
Oguz Ulgen	e0c3936843	[Inductor] Support ReinterpretView in inductor codegen (#113967 ) Adding support for ReinterpretView in inductor so that jagged MRS kernels can use native triton kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/113967 Approved by: https://github.com/aakhundov	2023-11-18 18:19:32 +00:00
Bin Bao	1480c670a0	[AOTI] Delay the fallback kernel naming decision to the codegen time (#113660 ) Summary: This is to prepare for a later change that changes AOTI's second-pass to perform codegen only. Differential Revision: [D51382677](https://our.internmc.facebook.com/intern/diff/D51382677) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113660 Approved by: https://github.com/chenyang78	2023-11-16 23:07:30 +00:00
Wei Wei	b19cf868e8	Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527 )" (#113747 ) Test Plan: sandcastle Differential Revision: D51330618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2023-11-15 22:42:22 +00:00
Yang Chen	a144eb502a	[aotinductor] add versions for the sdpa shim api (#113487 ) In our first implemenation of the sdpa shim api, we didn't consider the case where the optional scale argument could be None. It was unnoticed because we always got a default argument for the cuda backend. The issue was detected with the cpu backend. This PR implements versioning for shim kernels. Currently, we only have different versions for the sdpa api. We expect we would only maintain a very small number of abi-compatible shim APIs that had different versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113487 Approved by: https://github.com/int3, https://github.com/desertfire	2023-11-13 20:18:58 +00:00
Oguz Ulgen	6ea20f5dc5	[AOTI] Use expr_printer to print sympy expr (#113317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113317 Approved by: https://github.com/aakhundov, https://github.com/chenyang78	2023-11-13 20:14:04 +00:00
Jez Ng	7afb503e3c	[inductor] Label align() with [[maybe_unused]] (#113502 ) This squelches the "defined but not used" warning that occurs when memory planning is disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113502 Approved by: https://github.com/jansel	2023-11-12 16:33:47 +00:00
Jez Ng	5e03af8295	[inductor] Enable floor_div indexing to work under ABI-compat mode (#113276 ) Previously, floor_div operations were defined in ATen/native/BinaryOps.h. Since this header was not included under ABI-compat mode, trying to use those indexing operations would result in compilation errors. Technically, it is safe to use aten::native::floor_div_* functions in ABI-compat mode as they are header-only; we could simply include BinaryOps.h. However, there are other declarations in BinaryOps.h that are not binary-compatible, so this is not ideal. Thus, I have moved those functions into a separate file, and put them under c10/util, since they don't really have tensor-specific logic. c10 functions are not all header-only, so this still isn't ideal, but this still seems like an improvement. Moreover, cpp_prefix.h -- used when compiling cpp kernels -- already includes c10 header files, so ABI-compatibility already depends on maintaining some c10 functions as header-only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113276 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-11-11 02:51:29 +00:00
Oguz Ulgen	06dc2f162d	[AOTI] Implement support for user defined kernels that use triton.autotune (#113229 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113229 Approved by: https://github.com/chenyang78	2023-11-10 22:40:51 +00:00
Jez Ng	a2c32b8bd0	[inductor] Make codegen/{common,wrapper,cuda/cutlass_utils}.py pass follow_imports typechecking (#113411 ) SymIntType is referenced by wrapper.py, so I added its .pyi definition. I also added SymBoolType along the way for completeness. The `insinstance` checks in wrapper.py reference torch.Type, which seems to cause mypy to choke. Not entirely sure why; I've just added type-ignore comments for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113411 Approved by: https://github.com/Skylion007 ghstack dependencies: #113409, #113410	2023-11-10 19:58:08 +00:00
PyTorch MergeBot	2cd8c0565c	Revert "[AOTI] Implement support for user defined kernels that use triton.autotune (#113229 )" This reverts commit `1488bafb27`. Reverted https://github.com/pytorch/pytorch/pull/113229 on behalf of https://github.com/PaliC due to breaking test_aot_inductor.py tests though a forward fix is coming ([comment](https://github.com/pytorch/pytorch/pull/113229#issuecomment-1806159396))	2023-11-10 17:46:14 +00:00
Oguz Ulgen	1488bafb27	[AOTI] Implement support for user defined kernels that use triton.autotune (#113229 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113229 Approved by: https://github.com/chenyang78	2023-11-10 01:39:00 +00:00
Jez Ng	297c26bb8e	Support fp8 in AOTInductor + support optional<> in C ABI (#112527 ) This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358 It turns out that we need to add support for optional types in order to support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface can't support optional<> directly, I am passing in optional types via pointer instead. `AtenTensorHandle`s are already pointers, so nothing needs to change there. Only value types need to change. We decided on this approach instead of adding an extra `bool` param to the callee because this simplifies things. Having the same number of arguments regardless of whether we are emitting Python / C++ / ABI-compatible C++ makes codegen easier. There are a number of existing ABI-compatible functions that have optional-typed value parameters. Previously, they just assumed they would never be passed a `nullopt` / `None` at runtime. Changing them to use pointer types now would break ABI stability, so I have created an exclude list for those functions. Finally, I think the current implementation is kind of messy, and only works for FallbackKernels, even though technically ExternKernels could also have the same issue. It also doesn't support optional types nested in lists. I've left FIXME comments for both issues. Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-11-08 22:56:48 +00:00
Oguz Ulgen	8ba11bf79d	[AOTI] Support non auto-tuned triton kernels in aoti (#113090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113090 Approved by: https://github.com/aakhundov, https://github.com/chenyang78, https://github.com/desertfire	2023-11-08 07:48:15 +00:00
Oguz Ulgen	dbf44dffc9	[Inductor] Cache generated user defined triton kernels on tensor dtype and non tensor parameters (#112752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112752 Approved by: https://github.com/jansel	2023-11-07 05:29:16 +00:00
Oguz Ulgen	13d62e28a3	[Inductor] Add Dynamic shape support to user defined triton kernels (#112523 ) 1) This PR moves the grid function codegen to wrapper so that we can use IndentBuffers as opposed to manually adding tabs for indentation. 2) In inductor, emits the grid function in the body of the kernel call so that it can use free symbols from dynamic shapes Pull Request resolved: https://github.com/pytorch/pytorch/pull/112523 Approved by: https://github.com/Chillee	2023-11-02 23:58:50 +00:00

1 2 3 4 5 ...

252 Commits