pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Jason Ansel	a762dc0357	[inductor] Multi-kernel + cooperative reductions (#138893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138893 Approved by: https://github.com/shunting314 ghstack dependencies: #138533	2024-10-29 15:45:17 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
Adnan Akhundov	ab09c4d913	Add host-side TMA support to AOTInductor (#138878 ) This adds host-side Triton TMA support to AOTInductor. Notes: - Two helper functions, `init1DTMADescriptor` and `init2DTMADescriptor` are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific). - C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions. - Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen. - This PR concludes the host-side Triton TMA support in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138878 Approved by: https://github.com/desertfire, https://github.com/chenyang78 ghstack dependencies: #138759, #138877	2024-10-28 23:39:53 +00:00
PyTorch MergeBot	60d1c7138d	Revert "[inductor] Cooperative reductions (#137756 )" This reverts commit `fed37dbfbc`. Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))	2024-10-28 13:24:33 +00:00
Jason Ansel	fed37dbfbc	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison ghstack dependencies: #138970	2024-10-27 16:31:38 +00:00
Xinran / Allan Rui	ba6526814a	Add dtype attribute to CSEVariable (#136778 ) Summary: - This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op. - There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen. Test Plan: CI Differential Revision: D61815079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2024-10-25 18:00:30 +00:00
Adam Mainz	d0640b945b	[inductor][nit] removing unnecessary else statements (#138789 ) Summary: while reading through inductor template code I found a few places where else statements were driving me crazy. Fixing them as I read Test Plan: CI Differential Revision: D64882385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138789 Approved by: https://github.com/aakhundov	2024-10-25 17:59:25 +00:00
PyTorch MergeBot	3b186c5659	Revert "[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 )" This reverts commit `1417b2cd05`. Reverted https://github.com/pytorch/pytorch/pull/138303 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138303#issuecomment-2427991065))	2024-10-22 00:46:48 +00:00
Bin Bao	1417b2cd05	[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303 ) Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303 Approved by: https://github.com/hl475	2024-10-21 13:47:50 +00:00
Jason Ansel	4632594546	[inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252 ) There are some places where it would be nice to use this, but the scheduler hasn't yet been created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252 Approved by: https://github.com/eellison ghstack dependencies: #138170	2024-10-18 23:05:54 +00:00
Jason Ansel	85a6a782e5	[inductor] Generalize WorkspaceArg for graph-level semaphores (#138170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170 Approved by: https://github.com/Chillee	2024-10-18 23:05:54 +00:00
Adnan Akhundov	d116d007ee	Add host-side Triton TMA support to Inductor (#137950 ) This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument. - Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR. - `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel). - Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code). - New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA. - AOT Inductor support will be implemented in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950 Approved by: https://github.com/eellison ghstack dependencies: #137677	2024-10-18 06:27:24 +00:00
Bin Bao	2e67d7cc35	[AOTI] Remove the non-ABI-compatible mode (part 1) (#138009 ) Summary: The ABI-compatible mode has been turned on as default in https://github.com/pytorch/pytorch/pull/136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic. Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138009 Approved by: https://github.com/chenyang78 ghstack dependencies: #137982, #138016	2024-10-17 02:48:26 +00:00
Benjamin Glass	a968576777	Add lowering for aten.searchsorted (#135701 ) Adds lowering for `aten.searchsorted`. This entails: 1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`. 2. Adding support for striding to `ops.bucketize`. 3. Adding support for sorting tensors to `ops.bucketize`. 4. Adding a lowering for `aten.searchsorted.Tensor`. 5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors. 6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions. Closes #135873 Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701 Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98	2024-10-04 19:26:05 +00:00
Jez Ng	71aac59e93	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-30 20:24:52 +00:00
Jason Ansel	cf53ab95dc	[halide-backend] Fix ops.fma codegen (#136810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136810 Approved by: https://github.com/eellison ghstack dependencies: #136808, #136809	2024-09-28 19:26:04 +00:00
PyTorch MergeBot	36428f91e9	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `31c0467594`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))	2024-09-27 16:54:27 +00:00
Jez Ng	31c0467594	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet	2024-09-26 15:35:26 +00:00
Bin Bao	5ad5f40283	[AOTI][reland] Create another wrapper class to handle ArrayRef (#136461 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: [D63300361](https://our.internmc.facebook.com/intern/diff/D63300361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136461 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #136062	2024-09-25 14:00:09 +00:00
Bin Bao	95c0f7493f	[Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062 ) Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit. Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-09-24 21:02:51 +00:00
PyTorch MergeBot	274883083d	Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318 )" This reverts commit `d21841d077`. Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))	2024-09-23 17:47:49 +00:00
Bin Bao	d21841d077	[AOTI] Create another wrapper class to handle ArrayRef (#136318 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: D62961885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318 Approved by: https://github.com/frank-wei	2024-09-23 15:10:27 +00:00
Shangdi Yu	3bc073d728	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-22 04:51:37 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `e498b02b47`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Bin Bao	d833f49602	[reland][Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#136046 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues Test Plan: CI Differential Revision: D62658837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046 Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel	2024-09-16 14:35:19 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
PyTorch MergeBot	deee21cb78	Revert "[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 )" This reverts commit `16b37b309f`. Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))	2024-09-13 17:53:21 +00:00
PyTorch MergeBot	18f9331e5d	Revert "[aoti] Fix workspace generation for triton (#135552 )" This reverts commit `d383325392`. Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))	2024-09-13 17:47:36 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
xinan.lin	16b37b309f	[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313 Approved by: https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #135312	2024-09-11 23:59:54 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
xinan.lin	ca16956b20	[Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #134693	2024-09-10 10:11:52 +00:00
Jason Ansel	eac5e12548	[inductor] Move LoopBody to its own file (#135257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257 Approved by: https://github.com/oulgen	2024-09-07 16:29:15 +00:00
leslie-fang-intel	2c7e314803	[Inductor][CPP] Fix the issue of view dtype (#135301 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 23:36:44 +00:00
haozhe.zhu	f4641ca481	[Inductor] Remove VecChecker and fallback non-supported Vec op to Scalar impl with a for loop (#134569 ) Fall back non-vectorized op by scalar impl + for loop. Example code: ``` cpp_fused_igammac_0 = async_compile.cpp_pybinding(['const double', 'const double', 'double'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, const double* in_ptr1, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(48L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = at::vec::VectorizedN<double,2>(tmp1); auto tmp3 = [&]() { __at_align__ std::array<double, 8> tmpbuf0; tmp0.store(tmpbuf0.data(), 8); __at_align__ std::array<double, 8> tmpbuf1; tmp2.store(tmpbuf1.data(), 8); __at_align__ std::array<double, 8> tmpbuf_out; for (int i = 0; i < 8; i++) { tmpbuf_out[i] = calc_igammac(tmpbuf0[i], tmpbuf1[i]); } return at::vec::VectorizedN<double, 2>::loadu(tmpbuf_out.data(), 8); } () ; tmp3.store(out_ptr0 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(48L); x0<static_cast<int64_t>(50L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = calc_igammac(tmp0, tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; } } } ''') ``` `frexp` are difficult to be handled by common `fallback` since it returns two `cse_var` `2ba60a1618/torch/_inductor/codegen/cpp.py (L752-L766)` So we added a special function to do that. ``` cpp_fused_frexp_0 = async_compile.cpp_pybinding(['const double', 'double', 'int32_t'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, double* out_ptr0, int32_t* out_ptr1) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(16L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); at::vec::Vectorized<int32_t> tmp1; at::vec::VectorizedN<double, 2> tmp2; [&]() { __at_align__ std::array<double, 8> tmpbuf; tmp0.store(tmpbuf.data(), 8); __at_align__ std::array<int32_t, 8> tmpbuf_exponent; __at_align__ std::array<double, 8> tmpbuf_mantissa; for (int i = 0; i < 8; i++) { tmpbuf_mantissa[i] = std::frexp(tmpbuf[i], &tmpbuf_exponent[i]); } tmp1 = at::vec::Vectorized<int32_t>::loadu(tmpbuf_exponent.data(), 8); tmp2 = at::vec::VectorizedN<double, 2>::loadu(tmpbuf_mantissa.data(), 8); } (); tmp2.store(out_ptr0 + static_cast<int64_t>(x0), 8); tmp1.store(out_ptr1 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(16L); x0<static_cast<int64_t>(20L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; int32_t tmp1; auto tmp2 = std::frexp(tmp0, &tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; out_ptr1[static_cast<int64_t>(x0)] = tmp1; } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134569 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-31 11:19:57 +00:00
Rachel Guo	3965f11837	Minor type annotation updates following up D60954888 (#133382 ) Summary: As title. Test Plan: CI Ran lintrunner locally but might have to continue to keep an eye on more oss linting issue if comes up. Differential Revision: D61240900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133382 Approved by: https://github.com/ColinPeppler	2024-08-14 21:36:42 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
eellison	f32ab3b9e3	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-08-01 04:37:15 +00:00
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit `13d744464f`. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
leslie-fang-intel	f8e4060484	[Inductor][CPP] Enhance cppcsevar data type deduce (#130827 ) Summary Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](`096dc444ce/torch/_inductor/codegen/common.py (L1844)`)), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction: - We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`. - To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:51:31 +00:00
eellison	5772c13f56	Dont wrap negative indexing in scatter reduce (#131503 ) Fix for https://github.com/pytorch/pytorch/issues/131321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503 Approved by: https://github.com/shunting314	2024-07-24 04:01:32 +00:00
eellison	16a2a1aad3	Annotate graph.py (#131400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400 Approved by: https://github.com/shunting314	2024-07-23 07:04:12 +00:00
Peter Bell	27c2a0d63b	[inductor] Separate Buffer and Operation into two concepts (#130831 ) Resubmit of #128893 Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831 Approved by: https://github.com/lezcano	2024-07-20 02:05:07 +00:00
Xuehai Pan	f0075c179b	Pin `sympy >= 1.13.0` (#130895 ) ------ The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8. - #130836 See the PR description of #130836 for more details. `sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically: - Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations) > BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x2.0 != x2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr)) `sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689. - #130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895 Approved by: https://github.com/ezyang	2024-07-20 00:59:24 +00:00
Isuru Fernando	b7d2abd766	Fix vectorized ops.masked (#130130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130130 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-07-17 14:55:11 +00:00
chilli	f9f85bfc0b	[Inductor] FlexAttention supports partial masking (#130415 ) (#130626 ) This is the new version of https://github.com/pytorch/pytorch/pull/130415 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Approved by: https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626 Approved by: https://github.com/drisspg, https://github.com/yanboliang	2024-07-14 00:37:26 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Richard Zou	edf273edf4	Revert some PRs (#130303 ) Summary: Revert https://github.com/pytorch/pytorch/pull/129346 thru https://github.com/pytorch/pytorch/pull/128893 For S430832 Test Plan: Tests Differential Revision: D59503843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303 Approved by: https://github.com/bdhirsh	2024-07-09 14:46:00 +00:00
chilli	cd683212a2	Fix indexing twice with score_mod (#130224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224 Approved by: https://github.com/yanboliang ghstack dependencies: #130160, #130106	2024-07-08 18:15:35 +00:00

1 2 3 4 5 ...

290 Commits