pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Oguz Ulgen	dbf44dffc9	[Inductor] Cache generated user defined triton kernels on tensor dtype and non tensor parameters (#112752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112752 Approved by: https://github.com/jansel	2023-11-07 05:29:16 +00:00
Oguz Ulgen	13d62e28a3	[Inductor] Add Dynamic shape support to user defined triton kernels (#112523 ) 1) This PR moves the grid function codegen to wrapper so that we can use IndentBuffers as opposed to manually adding tabs for indentation. 2) In inductor, emits the grid function in the body of the kernel call so that it can use free symbols from dynamic shapes Pull Request resolved: https://github.com/pytorch/pytorch/pull/112523 Approved by: https://github.com/Chillee	2023-11-02 23:58:50 +00:00
Jez Ng	ae85ba820f	[inductor] Memory planning (#112178 ) This was originally @jansel's PR: https://github.com/pytorch/pytorch/pull/102625, which I've built upon. This diff implements static memory planning. It's disabled by default while we examine its performance. We use a greedy-by-size approach. For dynamic shapes, the sizes of the example inputs are used as estimates when making planning decisions. We generate expressions to calculate the actual memory offsets and sizes at runtime when the values of the dynamic shapes are known. In order to simplify these calculations, we have organized the allocations into a tree that branches on space (address offsets) and time (live ranges). Finally, we need to align these offsets, so we have added an `align` sympy Expr to express these calculations. Some limitations: 1. It is only enabled during inference for now. Enabling it for training increases peak memory usage as we allocate all the memory needed for training upfront, before freeing the memory allocated during inference. We can probably address this by doing planning for both the inference and training passes together. 2. It doesn't work with PyTorch Distributed, because kernels like AllGatherIntoTensor codegen strings which do memory operations. We can fix this down the line by having them emit MemoryPlanningLines instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-11-02 07:39:13 +00:00
David Berard	8191fb3e06	[Reland2] [inductor][BE] split triton_meta and inductor_meta (#112351 ) triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future. This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton. Tests - wait for CI Differential Revision: [D50864493](https://our.internmc.facebook.com/intern/diff/D50864493) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112351 Approved by: https://github.com/eellison	2023-11-02 00:40:12 +00:00
PyTorch MergeBot	74e6c877e9	Revert "[inductor] Memory planning (#112178 )" This reverts commit `f64a97c6f8`. Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too `f64a97c6f8` ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))	2023-11-01 00:03:56 +00:00
Jez Ng	f64a97c6f8	[inductor] Memory planning (#112178 ) This was originally @jansel's PR: https://github.com/pytorch/pytorch/pull/102625, which I've built upon. This diff implements static memory planning. It's disabled by default while we examine its performance. We use a greedy-by-size approach. For dynamic shapes, the sizes of the example inputs are used as estimates when making planning decisions. We generate expressions to calculate the actual memory offsets and sizes at runtime when the values of the dynamic shapes are known. In order to simplify these calculations, we have organized the allocations into a tree that branches on space (address offsets) and time (live ranges). Finally, we need to align these offsets, so we have added an `align` sympy Expr to express these calculations. Some limitations: 1. It is only enabled during inference for now. Enabling it for training increases peak memory usage as we allocate all the memory needed for training upfront, before freeing the memory allocated during inference. We can probably address this by doing planning for both the inference and training passes together. 2. It doesn't work with PyTorch Distributed, because kernels like AllGatherIntoTensor codegen strings which do memory operations. We can fix this down the line by having them emit MemoryPlanningLines instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-10-31 20:02:30 +00:00
Yang Chen	94f3df27e4	[aotinductor] reland: return a copy of any constant (#112370 ) When the model returns a constant, we cannot "release" its handle, because the constant doesn't have any handle at all. Instead, we should allocate a new tensor and then return a copy of the constant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112370 Approved by: https://github.com/hl475, https://github.com/desertfire	2023-10-31 18:36:44 +00:00
chunyuan	f50ec341bc	inductor cpp wrapper: add GIL release and acquire (#111888 ) Support multiple instances inference (in different threads of the same process) as in https://github.com/pytorch/pytorch/issues/93524#issuecomment-1421816158. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111888 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2023-10-31 03:23:30 +00:00
Oguz Ulgen	1250032c2e	[Inductor] Add triton.autotune support for user defined triton kernels with complex grids (#112290 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112290 Approved by: https://github.com/jansel	2023-10-30 17:48:27 +00:00
Oguz Ulgen	c14c4efc0e	[Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112228 Approved by: https://github.com/jansel	2023-10-28 17:30:35 +00:00
PyTorch MergeBot	8d44999183	Revert "[Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228 )" This reverts commit `dbb31a2984`. Reverted https://github.com/pytorch/pytorch/pull/112228 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm test in trunk `dbb31a2984` ([comment](https://github.com/pytorch/pytorch/pull/112228#issuecomment-1783660326))	2023-10-28 01:51:32 +00:00
Oguz Ulgen	dbb31a2984	[Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112228 Approved by: https://github.com/jansel	2023-10-27 21:40:22 +00:00
Bin Bao	f66cc67562	[aotinductor] Fix duplicated unbacked symbol declarations (#111823 ) Summary: For https://github.com/pytorch/pytorch/issues/111711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111823 Approved by: https://github.com/ezyang, https://github.com/aakhundov	2023-10-26 21:11:08 +00:00
angelayi	b126adcdee	[aotinductor] Pass TorchIR to AOTInductor (#110020 ) Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code. Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020 Approved by: https://github.com/desertfire	2023-10-26 15:54:31 +00:00
Oguz Ulgen	a29a844938	[Inductor] Support top level constants in user defined triton kernels (#111970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111970 Approved by: https://github.com/jansel ghstack dependencies: #111956	2023-10-25 02:43:51 +00:00
Oguz Ulgen	bb550b25c9	[Inductor] Support user defined triton kernels calling other triton kernels and activation functions (#111956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111956 Approved by: https://github.com/jansel	2023-10-25 02:39:43 +00:00
Oguz Ulgen	ddcf9c050b	[Inductor] Support calling user defined kernels with different type of arguments (#111939 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111939 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #111770, #111808	2023-10-24 19:49:48 +00:00
Scott Wolchok	a0043d4840	[PyTorch] AOTI: cache dtypes and device types at DSO load (#111820 ) Calling the `aoti_torch_{device_type,dtype}` functions on each iteration can impose high costs on overhead-bound CPU models because they can't be inlined across a DSO boundary. If we call them on load, we can use simple load instructions at run time. Differential Revision: [D50563682](https://our.internmc.facebook.com/intern/diff/D50563682/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111820 Approved by: https://github.com/chenyang78, https://github.com/desertfire ghstack dependencies: #111815, #111816	2023-10-24 18:37:26 +00:00
Scott Wolchok	6afd00a318	[PyTorch] AOTI: use array of constants (#111815 ) We continue to allow the user to set clients with a map, but under the hood we use an array of constants. model_container thought it was OK to hand over the map, assume we just kept a pointer, and then mutate the map later; I had to fix that. I hope there aren't other sites that do the same thing... Differential Revision: [D50111512](https://our.internmc.facebook.com/intern/diff/D50111512/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111815 Approved by: https://github.com/jansel, https://github.com/desertfire	2023-10-24 18:37:18 +00:00
Jez Ng	cbc6213f5d	[inductor] Defer memory operation lowering to wrapper (#111402 ) Right now, memory ops are being lowered to strings partly in scheduler.codegen() and partly in wrapper.codegen(). But that makes static memory planning (which is done entirely in `wrapper.codegen()`) difficult to implement as information is "lost" by that point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111402 Approved by: https://github.com/jansel	2023-10-24 03:47:56 +00:00
Jez Ng	e264b42a2e	[re-land][inductor] Refactor and optimize allocation calls (#111117 ) (#111511 ) Summary: This is a re-land of https://github.com/pytorch/pytorch/pull/111117 with updates to our internal tests included. This splits out changes from https://github.com/pytorch/pytorch/pull/102625 to make things easier to review. This diff creates a `make_allocation()` method that extracts the logic from `make_buffer_allocation()` while allowing us to allocate non-buffer objects. In particular, we will use this to allocate memory pools during memory planning. This diff also includes a small optimization -- if the desired allocation is contiguous, then we emit a call to `empty()` instead of `empty_strided()` with its superfluous stride argument. Test Plan: contbuild & OSS CI, see `9ce0ae836d` Differential Revision: D50429424 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111511 Approved by: https://github.com/jansel	2023-10-23 19:18:32 +00:00
Oguz Ulgen	2b2b6caf8f	[inductor] Implement clone removal for user defined triton kernel via reinplace_scatters (#111627 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111627 Approved by: https://github.com/jansel ghstack dependencies: #111434	2023-10-22 22:28:00 +00:00
Oguz Ulgen	977d3bcc46	[Inductor] Support user defined triton kernels in inductor (#111434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434 Approved by: https://github.com/jansel	2023-10-22 17:04:19 +00:00
Jason Ansel	a1154e673b	[Compiled Autograd] Turn accumulate_grad into an op (#111700 ) Relands #111271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111700 Approved by: https://github.com/voznesenskym	2023-10-21 17:31:09 +00:00
PyTorch MergeBot	3eb5cae3af	Revert "[Compiled Autograd] Turn accumulate_grad into an op (#111271 )" This reverts commit `04b04c0686`. Reverted https://github.com/pytorch/pytorch/pull/111271 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111271#issuecomment-1768527932))	2023-10-18 14:02:34 +00:00
PyTorch MergeBot	ed7739d690	Revert "[aot_inductor] return a copy of any constant (#111356 )" This reverts commit `71e1f34923`. Reverted https://github.com/pytorch/pytorch/pull/111356 on behalf of https://github.com/jeanschmidt due to Breaking internal ci ([comment](https://github.com/pytorch/pytorch/pull/111356#issuecomment-1768503640))	2023-10-18 13:51:30 +00:00
PyTorch MergeBot	08f580d498	Revert "[inductor] Refactor and optimize allocation calls (#111117 )" This reverts commit `9ce0ae836d`. Reverted https://github.com/pytorch/pytorch/pull/111117 on behalf of https://github.com/jeanschmidt due to Braking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111117#issuecomment-1768489865))	2023-10-18 13:45:02 +00:00
soulitzer	2dc1726ab7	Compile NestedTensor with AOTAutograd (#110529 ) This PR has a number of changes that improve subclass support for AOTAutograd/Inductor in general: - previously if a subclass does extra aliasing between graph outputs/inputs in a way, the partitioner would complain because grad_outputs are the outputs reused as-is. Now we do a view_as(self) to workaround this. - Use dense -> dense metadata when working with fwd_output_strides during backward. This is important since the stride information comes from inductor which sees the dense to dense graph. - Inductor requires that the inputs to the compiled backward to match some expected strides computed during compilation. We make sure to make the inner tensors of the subclass contiguous (previously, we only made the subclass itself contiguous) Changes specific to NestedTensor relevant to compilation: - Properly handle the case where `__tensor_unflatten__` is passed non-symbolic dense tensors and with meta extracted from fake subclasses. - Skip var_to_range logic for singleton int - Skip size hint logic in inductor for singleton int Pull Request resolved: https://github.com/pytorch/pytorch/pull/110529 Approved by: https://github.com/bdhirsh	2023-10-17 21:17:10 +00:00
Yang Chen	71e1f34923	[aot_inductor] return a copy of any constant (#111356 ) When the model returns a constant, we cannot "release" its handle, because the constant doesn't have any handle at all. Instead, we should allocate a new tensor and then return a copy of the constant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111356 Approved by: https://github.com/hl475	2023-10-17 08:44:21 +00:00
Jez Ng	9ce0ae836d	[inductor] Refactor and optimize allocation calls (#111117 ) This splits out changes from https://github.com/pytorch/pytorch/pull/102625 to make things easier to review. This diff creates a `make_allocation()` method that extracts the logic from `make_buffer_allocation()` while allowing us to allocate non-buffer objects. In particular, we will use this to allocate memory pools during memory planning. This diff also includes a small optimization -- if the desired allocation is contiguous, then we emit a call to `empty()` instead of `empty_strided()` with its superfluous stride argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111117 Approved by: https://github.com/jansel	2023-10-17 03:06:52 +00:00
Jason Ansel	04b04c0686	[Compiled Autograd] Turn accumulate_grad into an op (#111271 ) Rather than baking the behavior of `AccumulateGrad` nodes into the generated graph (either as `+=`, or as a return value of the graph). This creates a new `accumulate_grad_` dispatcher op that is included in the generated graph like: ``` def forward(self, inputs, sizes, hooks): getitem = inputs[0] getitem_1 = inputs[1] getitem_2 = inputs[2] getitem_3 = inputs[3] getitem_4 = inputs[4] getitem_5 = inputs[5] getitem_6 = inputs[6] getitem_7 = inputs[7] getitem_8 = inputs[8] getitem_9 = inputs[9]; inputs = None expand = torch.ops.aten.expand.default(getitem, [2, 4]); getitem = None threshold_backward = torch.ops.aten.threshold_backward.default(expand, getitem_1, 0); expand = getitem_1 = None t = torch.ops.aten.t.default(getitem_3); getitem_3 = None mm = torch.ops.aten.mm.default(threshold_backward, t); t = None t_1 = torch.ops.aten.t.default(threshold_backward) mm_1 = torch.ops.aten.mm.default(t_1, getitem_2); t_1 = getitem_2 = None t_2 = torch.ops.aten.t.default(mm_1); mm_1 = None sum_1 = torch.ops.aten.sum.dim_IntList(threshold_backward, [0], True); threshold_backward = None view = torch.ops.aten.view.default(sum_1, [4]); sum_1 = None t_3 = torch.ops.aten.t.default(t_2); t_2 = None accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, t_3); getitem_4 = t_3 = None threshold_backward_1 = torch.ops.aten.threshold_backward.default(mm, getitem_5, 0); mm = getitem_5 = None t_4 = torch.ops.aten.t.default(threshold_backward_1) mm_2 = torch.ops.aten.mm.default(t_4, getitem_6); t_4 = getitem_6 = None t_5 = torch.ops.aten.t.default(mm_2); mm_2 = None sum_2 = torch.ops.aten.sum.dim_IntList(threshold_backward_1, [0], True); threshold_backward_1 = None view_1 = torch.ops.aten.view.default(sum_2, [4]); sum_2 = None t_6 = torch.ops.aten.t.default(t_5); t_5 = None accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_7, t_6); getitem_7 = t_6 = None accumulate_grad__2 = torch.ops.inductor.accumulate_grad_.default(getitem_8, view_1); getitem_8 = view_1 = None accumulate_grad__3 = torch.ops.inductor.accumulate_grad_.default(getitem_9, view); getitem_9 = view = None return [] ``` The motivation here is `AccumulateGrad` nodes are causing trouble in FSDP tracing, since FSDP is in-place resizing parameters and parameter storage in hooks. We will model this mutation in dynamo, but not during the initial compiled autograd capture. This allows us to bypass failing shape checks in the initial capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111271 Approved by: https://github.com/voznesenskym	2023-10-16 21:16:17 +00:00
Scott Wolchok	84975339bd	[PyTorch] AOTI: generate reused thread_locals when tensors provably have static shape (#110892 ) If a Tensor can be reused and has static shape, we can just cache it across iterations. This is meant as a quickly shippable overhead reduction for CPU overhead-bound use cases that we can ship without relying on memory planning. Differential Revision: [D50023678](https://our.internmc.facebook.com/intern/diff/D50023678/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110892 Approved by: https://github.com/bertmaher ghstack dependencies: #110876, #110877, #110909	2023-10-13 16:07:05 +00:00
Bin Bao	6b4c686b9a	[aotindutor] Forward fix a performance regression (#110800 ) Summary: Forward fix a performance regression caused by https://github.com/pytorch/pytorch/pull/110510. When a model is run once, all those kernel pointers are initialized and removing the if-nullptr check will cause those loadKernel be unnecessarily executed again when we rerun the foward function. Another way to do this is to codegen loadKernel in the initializer, which I may do in a later PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110800 Approved by: https://github.com/jansel	2023-10-08 04:06:44 +00:00
Adnan Akhundov	abb00f66d8	[inductor] Add AOTI ABI shim function for repeat_interleave.Tensor (#110745 ) Summary: `repeat_interleave.Tensor` doesn't have inductor lowering. To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_repeat_interleave ... ---------------------------------------------------------------------- Ran 4 tests in 70.526s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110745 Approved by: https://github.com/chenyang78 ghstack dependencies: #110713	2023-10-07 08:18:01 +00:00
Bin Bao	298f01d9a2	[aotinductor] Avoid generating redundant kernel loading code (#110510 ) Summary: 1) Stop forcing triton.unique_kernel_names to True for AOTInductor, because the unique kernel name can be read from metadata; 2) Only generate load_kernel once for each kernel since we don't have control flow in our generated code. This solves https://github.com/pytorch/pytorch/issues/105553. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110510 Approved by: https://github.com/chenyang78, https://github.com/jansel	2023-10-05 19:59:38 +00:00
Sherlock Huang	f1b94461aa	[AOTInductor] ProxyExecutor support Dynamic Shape (#110526 ) Summary: Extend ProxyExecutor to support dynamic shape. Example of ProxyExecutor invocation with symints. ``` int64_t* arg0_1_size; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size)); auto s0 = arg0_1_size[0]; auto s1 = arg0_1_size[1]; int64_t* arg1_1_size; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg1_1, &arg1_1_size)); auto s2 = arg1_1_size[0]; auto s3 = arg1_1_size[1]; ... aoti_torch_proxy_executor_call_function(proxy_executor, 0, 15, std::vector<int64_t>{42, 16, 17, s0 + s1, s0 + s1, s2s3, 45, 67, 16, 17, s2s3, s2s3, s0 + s1, 89, 910}.data(), 7, std::vector<AtenTensorHandle>{arg0_1, arg0_1, arg1_1, buf2, arg0_1, arg1_1, buf4}.data()); ``` Example of serialized SymInt(s) arguments: ``` { "name": "symint", "arg": { "asSymInt": { "asName": "s0 + s1" } } }, { "name": "symints", "arg": { "asSymInts": [ { "asName": "s0 + s1" }, { "asName": "s2s3" } ] } }, ... { "name": "o_symint", "arg": { "asSymInt": { "asName": "s2s3" } } }, { "name": "o_symints", "arg": { "asSymInts": [ { "asName": "s2s3" }, { "asName": "s0 + s1" } ] } }, ``` Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops Differential Revision: D49887555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110526 Approved by: https://github.com/chenyang78	2023-10-05 19:05:20 +00:00
Oleg Khabinov	cf1b494afd	[AOTInductor] Store loaded kernels in the model (#110554 ) Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices. Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0. This change makes devices remembered at model level in AOT mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-10-05 10:17:05 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
Yang Chen	46a5558cd5	[AOTInductor] Simplified AOTInductor interface and model class (#110411 ) Summary: This PR removed several APIs from the AOTInductor interface, which are not used by the client. It also simplified AOTInductor's model class by removing the dim info for input/output tensors. We included dim info before to return max output shapes, which was used by the client to allocate memory for output tensors. Now, we allocate output tensor memory from the .so so that we don't need to maintain such information any more. The deletion of dim info from the model class also simplified the codegen quite a bit. Test Plan: ci Reviewed By: khabinov Differential Revision: D49835430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411 Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel	2023-10-04 18:35:24 +00:00
Bin Bao	539367f0bc	[aotindutor] Refactor optional value codegen (#110233 ) Summary: Simplify the codegen for optional values by using c10::nullopt, and we don't need placeholders like OptionalScalar because we can simply use None for that purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110233 Approved by: https://github.com/jansel	2023-10-04 17:18:02 +00:00
Sherlock Huang	50054b1a62	[AOTInductor] ProxyExecutor support ReinterpretView inputs (#110451 ) Summary: See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem. ``` # NB, the return handle here represents a temporary tensor, which will be automatically # released. # Here's a sample usage in the cpp wrapper code: # ``` # aoti_torch_addmm_out( # buf1, # arg1_1, # RAIIAtenTensorHandle(tmp_tensor_handle_0), # buf0, # 1L, # 1L)); # ``` # RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out. # This could be problematic when it's used in a different pattern, for example: # ```` # AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}; # aoti_torch_proxy_executor_call_function(..., tensor_args); # ```` # RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter # kernel call. return f"RAIIAtenTensorHandle({tmp_name})" ``` As a result, ProxyExecutor would generate following code, which cause invalid memory access. Before: ``` // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output] AtenTensorHandle tmp_tensor_handle_2; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2)); ... AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}; int64_t int_args[] = {1}; aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args); buf3.reset(); ``` With fix in this diff, ProxyExecutor generates following code After: ``` // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output] AtenTensorHandle tmp_tensor_handle_2; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2)); ... aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data()); buf3.reset(); ``` I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix. Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops Reviewed By: desertfire Differential Revision: D49758764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451 Approved by: https://github.com/desertfire	2023-10-04 02:20:31 +00:00
Mu-Chu Lee	836ba6430a	[AOTInductor] Initial functionality for Inf and NaN checker (#109526 ) Summary: Add initial functionality for Inf and NaN checker for AOTInductor. Test Plan: Included in commit. Skipped for CI as SIGABRT can't be captured by pytest. Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D49379751](https://our.internmc.facebook.com/intern/diff/D49379751) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109526 Approved by: https://github.com/chenyang78	2023-10-03 22:39:42 +00:00
Yang Chen	da63c7f2c3	[AOTInductor] remove CUDA dependency for cpp backend (#110409 ) Summary: Previously, we link against cuda libs even for pure cpp backend. This caused issues for cases where the inference platform does not have GPUs. This diff removed cuda dependency for cpp backend. Reviewed By: bertmaher, muchulee8, mikekgfb Differential Revision: D49800712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409 Approved by: https://github.com/bertmaher, https://github.com/desertfire	2023-10-03 18:36:00 +00:00
Sherlock Huang	898656e9d1	[AOTInductor] ProxyExecutor supports Tuple of Tensor and List[Tensor] in returns (#110187 ) Summary: ProxyExecutor supports custom ops that return a tuple mixed of Tensor and List[Tensor] e.g. `"fn_with_mix_outputs(Tensor t, Tensor[] tensors) -> (Tensor, Tensor[])"` Example: `out7, [out8, out9] = torch.ops.fb.fn_with_mix_outputs(out5, [out6, out4])` got compiled into ``` AtenTensorHandle buf11_handle; // output buffer AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf11_handle)); RAIIAtenTensorHandle buf11(buf11_handle); AtenTensorHandle buf12_handle; // output buffer AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf12_handle)); RAIIAtenTensorHandle buf12(buf12_handle); AtenTensorHandle buf13_handle; // output buffer AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf13_handle)); RAIIAtenTensorHandle buf13(buf13_handle); AtenTensorHandle tensor_args_var_7[] = {buf8.get(), buf9.get(), buf6.get(), buf11.get(), buf12.get(), buf13.get()}; int64_t int_args_var_8[] = {}; aoti_torch_proxy_executor_call_function(proxy_executor, 3, 0, int_args_var_8, 6, tensor_args_var_7); ``` Serialized extern node ``` { "name": "buf10", "node": { "target": "fb::fn_with_mix_outputs", "inputs": [ { "name": "t", "arg": { "asTensor": { "name": "buf8" } } }, { "name": "tensors", "arg": { "asTensors": [ { "name": "buf9" }, { "name": "buf6" } ] } } ], "outputs": [ { "asTensor": { "name": "buf11" } }, { "asTensors": [ { "name": "buf12" }, { "name": "buf13" } ] } ], "metadata": {} } } ``` Test Plan: Test Differential Revision: D49710320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110187 Approved by: https://github.com/chenyang78	2023-09-30 19:47:01 +00:00
Adnan Akhundov	2ead6c2f6e	Skip launching kernels with zero grid in AOT Inductor (#110312 ) Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen. In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312 Approved by: https://github.com/chenyang78	2023-09-30 09:12:56 +00:00
Sherlock Huang	d7de26804e	[AOTInductor] ProxyExecutor supports List[Tensor] return type (#110182 ) Summary: Support custom ops returns List[Tensor] type, like `"fn_with_list_output(Tensor[] tensors, int i) -> Tensor[]"` As an example `out5, out6 = torch.ops.fb.fn_with_list_output([out3, out4], 1)` got compiled into ``` AtenTensorHandle buf8_handle; // output buffer AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf8_handle)); RAIIAtenTensorHandle buf8(buf8_handle); AtenTensorHandle buf9_handle; // output buffer AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf9_handle)); RAIIAtenTensorHandle buf9(buf9_handle); AtenTensorHandle tensor_args_var_5[] = {buf5.get(), buf6.get(), buf8.get(), buf9.get()}; int64_t int_args_var_6[] = {1}; aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, int_args_var_6, 4, tensor_args_var_5); ``` Test Plan: Test Differential Revision: D49694691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110182 Approved by: https://github.com/chenyang78	2023-09-29 18:21:48 +00:00
Yang Chen	30759848fa	[inductor] handle non-list/tuple outputs for FallbackKernel (#110145 ) generate_output may return non-list/tuple outputs. Let's force those to be list, because we will enumerate kernel.outputs later in the codegen. Also fixed a minor issue in an assertion message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110145 Approved by: https://github.com/aakhundov	2023-09-29 17:13:26 +00:00
Sherlock Huang	7f2b51c668	[AOTInductor] ProxyExecutor supports custom op with tuple output (#110140 ) Summary: Extend ProxyExecutor to support custom ops with tuple outputs. Generated wrapper code for `out3, out4 = torch.ops.fb.fn_with_tuple_output(out2, 1)` ``` AtenTensorHandle buf5_handle; // output buffer AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf5_handle)); RAIIAtenTensorHandle buf5(buf5_handle); AtenTensorHandle buf6_handle; // output buffer AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf6_handle)); RAIIAtenTensorHandle buf6(buf6_handle); AtenTensorHandle tensor_args_var_3[] = {buf3.get(), buf5.get(), buf6.get()}; int64_t int_args_var_4[] = {1}; aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args_var_4, 3, tensor_args_var_3); ``` Test Plan: Test Differential Revision: D49673994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110140 Approved by: https://github.com/chenyang78	2023-09-28 02:50:39 +00:00
Sherlock Huang	ec5bbef8af	[AOTInductor] Switch ProxyExecutor to use AtenTensorHandle (#109748 ) Summary: Switch ProxyExecutor to use AtenTensorHandle. Test Plan: E2E Test Differential Revision: D49471659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109748 Approved by: https://github.com/yifuwang, https://github.com/desertfire, https://github.com/chenyang78	2023-09-27 17:51:30 +00:00
Yang Chen	4d0ae7c9da	[inductor] support _scaled_dot_product_flash_attention fallback (#110085 ) Summary: This PR supports _scaled_dot_product_flash_attention fallback kernel. Note that in the abi_compatible mode, we retrieve outputs by passing output argument pointers rather than relying on std::get. It also fixes an issue related to dynamic shapes, where we wrongfully query undefined dynamic symbols. Test Plan: ci Reviewed By: frank-wei Differential Revision: D49620191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110085 Approved by: https://github.com/desertfire	2023-09-27 00:09:56 +00:00

1 2 3 4 5

204 Commits