pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Boyuan Feng	f3e5078c27	[Inductor] Relax size constraints for re-inplacing (#143884 ) Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing. This PR changes the size constraints to be: - input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str). - it's statically known that 0.99 * input_size <= output_size <= input_size ### Apply on llm.c See the reuse of `buf1`. Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078)) ![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9) After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053)) ![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884 Approved by: https://github.com/eellison	2024-12-31 03:52:47 +00:00
Animesh Jain	969415885d	[inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373 Approved by: https://github.com/eellison	2024-12-27 06:46:09 +00:00
Tom Ritchford	f1cbf4b1b5	Enable ruff's unused variable checking everywhere in pytorch (#136965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-12-22 02:33:11 +00:00
Tom Ritchford	da67a6a7bb	[inductor] Replace set by OrderedSet (#138466 ) Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454 and considerable manual editing Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466 Approved by: https://github.com/eellison	2024-12-13 16:08:45 +00:00
Tom Ritchford	dc23f1944a	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-12 17:39:14 +00:00
PyTorch MergeBot	5c97ac9721	Revert "Remove unused Python variables in torch/[_-a]* (#133492 )" This reverts commit `fda975a7b3`. Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else. The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))	2024-12-11 17:29:12 +00:00
Tom Ritchford	fda975a7b3	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-10 21:48:44 +00:00
Alex Denisov	539286a67b	Inductor annotations (#130429 ) Add NVTX annotations around training phases and buffer computations RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224 <img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429 Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>	2024-12-10 08:53:39 +00:00
Bin Bao	4d43ec2189	[AOTI] Swith GPU codegen to one-pass (#141980 ) Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption. Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980 Approved by: https://github.com/chenyang78	2024-12-09 14:40:34 +00:00
Bin Bao	5035ff0796	[AOTI] Refactor codegen_inputs signature (#142133 ) Summary: Since codegen_inputs only writes to self.prefix, drop IndentedBuffer from its parameters, to make the API consistent with other similar functions. Differential Revision: [D66881040](https://our.internmc.facebook.com/intern/diff/D66881040) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142133 Approved by: https://github.com/chenyang78	2024-12-08 15:05:03 +00:00
Jason Ansel	0367a31401	[inductor] Minor typing changes (#142219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219 Approved by: https://github.com/Skylion007, https://github.com/yanboliang	2024-12-07 17:48:37 +00:00
Bin Bao	39482907be	[AOTI] Refactor codegen_inputs in wrapper codegen (#141965 ) Summary: Fork codegen_inputs for CppWrapperCodegen, because the behavior between python and cpp needs to diverge. On the python side, input backed symbols need to be generated for the autotune block. This is to prepare for one-pass AOTI CUDA codegen. Differential Revision: [D66718225](https://our.internmc.facebook.com/intern/diff/D66718225) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141965 Approved by: https://github.com/chenyang78 ghstack dependencies: #141388, #141387, #141979	2024-12-05 19:49:34 +00:00
Bin Bao	2fd8a7be71	[AOTI] Refactor additional_files generation (#141979 ) Summary: https://github.com/pytorch/pytorch/pull/140675 adds logic to collect all the generated cubin file paths into an additional_files list, but the collection should only happen when DeferredGpuKernelLine is materialized. This is to prepare for one-pass AOTI CUDA codegen. Differential Revision: [D66718227](https://our.internmc.facebook.com/intern/diff/D66718227) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141979 Approved by: https://github.com/chenyang78 ghstack dependencies: #141388, #141387	2024-12-05 19:49:02 +00:00
Bin Bao	5f28c42746	[AOIT] Remove several overloaded members from WrapperCodegen (#141387 ) Summary: Remove several overloaded string members from WrapperCodegen classes, including open_bracket, closed_braket, size, stride. Instead of relying on polymorphism, we explicitly generate different strings for PythonWrapperCodegen and CppWrapperCodegen. This is to prepare for one-pass AOTI CUDA codegen. Differential Revision: [D66459991](https://our.internmc.facebook.com/intern/diff/D66459991) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141387 Approved by: https://github.com/chenyang78 ghstack dependencies: #141388	2024-12-05 19:29:38 +00:00
Bin Bao	4cc0fc2707	[AOTI] Remove WrapperCodegen.expr_printer (#141388 ) Summary: Avoid using expr_printer as an overriden class member for WrapperCodegen. Instead, use pexpr and cexpr explicitly for python and cpp expression print respectively. This is to prepare for one-pass AOTI CUDA codegen, where PythonWrapperCodegen is used to generate the autotune block and CppWrapperCodegen is used to generate the model code. Differential Revision: [D66459992](https://our.internmc.facebook.com/intern/diff/D66459992) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141388 Approved by: https://github.com/chenyang78	2024-12-05 19:20:39 +00:00
Adnan Akhundov	f16e08042c	[user triton] Fix grid codegen for configs with empty kwargs (#141824 ) Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824 Approved by: https://github.com/Chillee	2024-12-02 04:17:21 +00:00
Boyuan Feng	17fd53d8e5	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-27 18:51:52 +00:00
PyTorch MergeBot	65dbd5cc2d	Revert "[Inductor] Inplacing with Donated Buffer (#140113 )" This reverts commit `eecc8e362c`. Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))	2024-11-26 21:20:59 +00:00
Yidi Wu	000d4e9d43	[hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181 ) Before the PR: P1683356646 after the pr: P1683356585 Relevant changes: ``` @@ -231,7 +421,8 @@ true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1] del true_graph_0_arg0_1 del true_graph_0_arg1_1 + (buf5[0],) = true_graph_0(true_graph_0_args) - (true_graph_0_buf0,) = true_graph_0(true_graph_0_args) - buf5[0] = true_graph_0_buf0 else: # subgraph: false_graph_0 false_graph_0_arg0_1 = buf4 @@ -239,7 +430,8 @@ false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1] del false_graph_0_arg0_1 del false_graph_0_arg1_1 + (buf5[0],) = false_graph_0(false_graph_0_args) - (false_graph_0_buf0,) = false_graph_0(false_graph_0_args) - buf5[0] = false_graph_0_buf0 del arg2_1 del buf4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181 Approved by: https://github.com/anijain2305 ghstack dependencies: #140334, #141172	2024-11-26 17:32:51 +00:00
Yidi Wu	aae581d921	[hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172 Approved by: https://github.com/Chillee ghstack dependencies: #140334	2024-11-26 17:32:50 +00:00
Boyuan Feng	eecc8e362c	[Inductor] Inplacing with Donated Buffer (#140113 ) Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions. Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible. [Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee) ![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478) ![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113 Approved by: https://github.com/eellison	2024-11-26 17:19:50 +00:00
xinan.lin	4742080ed9	[AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire	2024-11-26 11:51:32 +00:00
Jason Ansel	6eca0aee76	[inductor] Refactor ir.Layout into ir.OutputSpec (#140910 ) This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs). Which should make typing easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910 Approved by: https://github.com/ezyang ghstack dependencies: #140895	2024-11-21 20:01:57 +00:00
Sam Ginzburg	a847790400	[inductor] reset to zero support for user defined Triton kernels (#140982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140982 Approved by: https://github.com/aakhundov	2024-11-21 18:53:23 +00:00
Jason Ansel	808f0f656d	[inductor] Refactor MutableBox to make IRNode typing easier (#140895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-11-20 19:50:46 +00:00
Benjamin Glass	4ffce45100	AOTInductor: properly generate cpp_wrapper runtime assertions (#141050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141050 Approved by: https://github.com/desertfire ghstack dependencies: #141058	2024-11-20 19:17:47 +00:00
PyTorch MergeBot	d472a5f680	Revert "[inductor] Refactor MutableBox to make IRNode typing easier (#140895 )" This reverts commit `c79e78b503`. Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319))	2024-11-19 04:25:41 +00:00
Jason Ansel	c79e78b503	[inductor] Refactor MutableBox to make IRNode typing easier (#140895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-11-19 00:24:35 +00:00
Bin Bao	819b0ebd94	cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 ) Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use. Closes #135559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411 Approved by: https://github.com/desertfire Co-authored-by: Bin Bao <binbao@meta.com>	2024-11-17 04:16:59 +00:00
Oguz Ulgen	a173186566	[RFC] Implement caching for user defined triton kernels (#140326 ) This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments. One HUGE hack we do here is a node looks like ``` triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_ metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']); ``` so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project. Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326 Approved by: https://github.com/zou3519	2024-11-16 02:37:16 +00:00
Angela Yi	baf756a785	[reland] [aoti] Selectively package AOTI generated files (#140675 ) Summary: Reland https://github.com/pytorch/pytorch/pull/140022 Test Plan: CI Differential Revision: D65929964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140675 Approved by: https://github.com/desertfire	2024-11-15 23:48:34 +00:00
PyTorch MergeBot	222d4b48b1	Revert "cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 )" This reverts commit `761b42bc08`. Reverted https://github.com/pytorch/pytorch/pull/139411 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))	2024-11-14 19:25:46 +00:00
Bin Bao	85deef9ede	[AOTI][refactor] Rename generate_extern_kernel_alloc_and_find_schema_if_needed (#140447 ) Summary: Rename generate_extern_kernel_alloc_and_find_schema_if_needed to better reflect its meaning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140447 Approved by: https://github.com/chenyang78	2024-11-14 01:40:58 +00:00
Oguz Ulgen	26fde110db	Refactor user-defined triton kernel source code collection (#140577 ) Differential Revision: [D65895743](https://our.internmc.facebook.com/intern/diff/D65895743) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140577 Approved by: https://github.com/zou3519	2024-11-13 22:12:17 +00:00
PyTorch MergeBot	b4cc5d38b4	Revert "[aoti] Remove dir after packaging (#140022 )" This reverts commit `ba136a78ba`. Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2473814720))	2024-11-13 14:43:15 +00:00
angelayi	ba136a78ba	[aoti] Remove dir after packaging (#140022 ) Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list. This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022 Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet	2024-11-13 12:17:19 +00:00
PyTorch MergeBot	d48ea29b9a	Revert "[aoti] Remove dir after packaging (#140022 )" This reverts commit `8c6abe5a8c`. Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2471847439))	2024-11-12 23:35:27 +00:00
Bin Bao	1f590feaf7	[AOTI][refactor] Update codegen_int_array_var API (#140299 ) Summary: codegen_int_array_var and codegen_reinterpret_view need to call different writeline functions depending on which part of code it's writing. Previously their APIs take a writer and implicitly assign a default writer if needed, which is not intuitive. Update their APIs to explicitly take a writeline function. Differential Revision: [D65774584](https://our.internmc.facebook.com/intern/diff/D65774584) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140299 Approved by: https://github.com/frank-wei, https://github.com/chenyang78	2024-11-12 21:39:41 +00:00
angelayi	8c6abe5a8c	[aoti] Remove dir after packaging (#140022 ) Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list. This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022 Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet	2024-11-12 21:36:24 +00:00
Benjamin Glass	761b42bc08	cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411 ) Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use. Closes #135559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411 Approved by: https://github.com/desertfire	2024-11-12 15:22:38 +00:00
Bin Bao	2c77352fe2	[AOTI][refactor] Clean up call chain in wrapper codegen (#136531 ) Summary: For cpp wrapper, generate_kernel_call and define_kernel need to handle both cpu and gpu kernels. Refactor the code to remove nested super() calls. Differential Revision: [D65639095](https://our.internmc.facebook.com/intern/diff/D65639095) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136531 Approved by: https://github.com/frank-wei	2024-11-11 22:00:42 +00:00
Adnan Akhundov	838958de94	[inductor] Support autotune restore_value for user-defined Triton kernels (#139851 ) This PR adds support for the `restore_value` argument of the `@triton.autotune` for the user-defined Triton kernels in PT2. The `kernel.restore_idx` are extracted in the `ir.UserDefinedTritonKernel` and the corresponding arg names are placed into the `triton_meta["restore_value"]`. From there, those are added to the existing `mutated_arg_names` in the caching autotuner infra which already exists and leads to the listed argss being cloned. This achieves the equivalent effect to the native `restore_value`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139851 Approved by: https://github.com/oulgen	2024-11-08 14:59:00 +00:00
Wu, Chunyuan	a3052b3b7c	Inductor cpp wrapper: clean-up hard-coded schema and related code (#139873 ) Fixes https://github.com/pytorch/pytorch/issues/112552. non-ABI compatible mode has been removed thus the following values are not needed anymore: `extern_call_ops` `cpp_op_schema` `cpp_kernel_key` `cpp_kernel_overload_name` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139873 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-11-08 08:15:51 +00:00
PyTorch MergeBot	f3238106fd	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit `030f70b40b`. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))	2024-11-02 06:53:48 +00:00
Gabriel Ferns	030f70b40b	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-11-01 01:24:40 +00:00
Yifu Wang	7765d1ef70	Preliminary registered-buffer collective support via Inductor (#138029 ) ``` NOTE [lowering-time collective optimization] In collective communication libraries such as NCCL, every rank maintains communication buffers that are remotely accessible by some peers. Depending on the underlying transport, remote accessibility may be established via mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these buffers are private to the communication library by default, and communication ops copy user data in and out of these buffers. To prevent these copies, an optimization commonly known as "user buffer registration" can be employed. This allows direct establishment of remote accessibility on user buffers, eliminating the need for copying. However, this optimization introduces stringent usage requirements, which are typically hard to satisfy without being intrusive to the user code: - Establishing remote accessibility is expensive and often done ahead of time. In such implementations, all ranks must agree on the set of allocations used for every collective op. Failing to meet this requirement can lead to runtime errors or even silent correctness issues. - Even if the collective communication library supports gracefully falling back to "unregistered" implementations, the fallback mechanism would nullify the optimization. - Some communication mechanisms impose stricter requirements than others. For example, CUDA's multicast + multi-mem instructions require all ranks to agree not only on the allocations used for every collective but also on the offsets within these allocations. To support all different mechanisms with optimal results, we aim to satisfy the strictest requirement for this family of optimizations - we ensures that every collective op invocation is guaranteed to operate on the same allocation, at the same offset, in every iteration. For eligible collective ops, we identify communication buffers at lowering time and optionally choose to lower the op to a different kernel (ommunication libraries like NCCL handle both registered and non-registered buffers transparently within the same op, though some may require different ops for different cases). Later, the codegen will perform "persistent allocation" to satisfy the aforementioned constraints, and optionally, perform buffer planning to optimize overall memory usage. ``` ### Changes - Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file. - Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer. - Added codegen allocation support for comm buffers of type "symm_mem". - Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`. - Added an Inductor config for collective optimizations in general (`config._collective`). ### Limitation Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029 Approved by: https://github.com/Chillee ghstack dependencies: #138028	2024-10-30 18:11:09 +00:00
drisspg	a884462bca	Add workspace to TritonTemplates (#138050 ) Here's a markdown summary for the PR: # Add workspace buffer support for Triton templates ## Summary Adds support for templates to allocate and use temporary workspace buffers ## Key Changes - Add `WorkspaceArg` support in Triton template system - Automatic workspace allocation/deallocation around kernel execution - Zero-initialization support for workspace buffers - Seamless integration with existing tensor management ## Example Usage ```python def generate(self, ...): workspace_arg = WorkspaceArg( count=1024*1024, # 1MB workspace zero_fill=True # Zero-initialized ) return TritonTemplateCaller(..., workspace_arg=workspace_arg) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138050 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-10-29 18:17:54 +00:00
Sam Ginzburg	93d7f90c3a	[inductor] getting AOT inductor to treat None args correctly (#139114 ) Differential Revision: [D65102228](https://our.internmc.facebook.com/intern/diff/D65102228) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139114 Approved by: https://github.com/aakhundov	2024-10-29 08:11:53 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
Adnan Akhundov	ab09c4d913	Add host-side TMA support to AOTInductor (#138878 ) This adds host-side Triton TMA support to AOTInductor. Notes: - Two helper functions, `init1DTMADescriptor` and `init2DTMADescriptor` are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific). - C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions. - Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen. - This PR concludes the host-side Triton TMA support in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138878 Approved by: https://github.com/desertfire, https://github.com/chenyang78 ghstack dependencies: #138759, #138877	2024-10-28 23:39:53 +00:00

1 2 3 4 5 ...

494 Commits