pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit `e498b02b47`. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
PyTorch MergeBot	0199fd4d7e	Revert "[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 )" This reverts commit `e54b559e88`. Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))	2024-09-16 17:58:02 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
PyTorch MergeBot	18f9331e5d	Revert "[aoti] Fix workspace generation for triton (#135552 )" This reverts commit `d383325392`. Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))	2024-09-13 17:47:36 +00:00
Jez Ng	b346e99376	remove fast_flush arguments (#135387 ) I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value. Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-09-13 08:13:46 +00:00
Jokeren	e54b559e88	[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 ) Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406 Approved by: https://github.com/jansel	2024-09-13 04:10:41 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
Shunting Zhang	8057b72763	[ez][inductor] don't benchmark cloning if there are no mutated args (#135533 ) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533 Approved by: https://github.com/jansel ghstack dependencies: #135531	2024-09-10 20:54:31 +00:00
Shunting Zhang	7b17918dc9	[inductor] fix a device sync issue for benchmarking fusion (#135531 ) Fix https://github.com/pytorch/pytorch/issues/134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531 Approved by: https://github.com/jansel	2024-09-10 20:54:31 +00:00
Sam Larsen	1adf28a5c0	[inductor] print triton float64 constants correctly (#135260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260 Approved by: https://github.com/jansel	2024-09-10 07:05:02 +00:00
David Berard	a086882d72	[inductor][triton] mark workspace args as mutated (#134648 ) SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such. Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed. When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected. `804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-09-06 14:23:37 +00:00
Xinran / Allan Rui	1f19ccb5b3	[Inductor/Triton] Customize triton codegen to optionally preserve input dtype on tl.load (#132406 ) Differential Revision: D60536337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132406 Approved by: https://github.com/jfix71, https://github.com/blaine-rister	2024-08-23 22:58:43 +00:00
Mwiza Kunda	be207af6e1	Disable unwrapping scalar tensors when used as outputs (#132859 ) If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor. Fixes #ISSUE_NUMBER @yanboliang @jansel @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859 Approved by: https://github.com/jansel	2024-08-16 21:40:45 +00:00
Isuru Fernando	b444343087	Fix printing symfloat pow in triton (#133614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133614 Approved by: https://github.com/Skylion007	2024-08-16 13:08:29 +00:00
Isuru Fernando	7470ae85e4	Fix triton codegen with math.trunc (#133354 ) Fixes https://github.com/pytorch/pytorch/issues/133172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133354 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-08-15 16:38:26 +00:00
y-sq	b6335cfeab	Add an option to use do_bench_using_profiling in TORCHINDUCTOR_PROFILE (#133523 ) When I did profiling using the "TORCHINDUCTOR_PROFILE" option, some kernel shows less bandwidth than expected. So, added the option to exclude the CPU overheads from the profiling time: ``` # With the option: (pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_WITH_DO_BENCH_USING_PROFILING=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py 0.038ms 0.067 GB 1777.11GB/s triton_poi_fused__to_copy_clamp_clone_mul_0 SUMMARY (/tmp/torchinductor_shuqiyang/tmp03wdg8e4/m6/cm6vdqp62ofwsone3u3fmb42vs3fti5omseo3qn4ddh2bhalsvbn.py) 0.04ms 0.07 GB 1777.11GB/s # Without the option: (pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py 0.040ms 0.067 GB 1663.09GB/s triton_poi_fused__to_copy_clamp_clone_mul_0 SUMMARY (/tmp/torchinductor_shuqiyang/tmpwr6rraao/s4/cs4npkh77myatwpcmsizyduyfm6ne6o4pg4n3eodejdvvg2j3xzd.py) 0.04ms 0.07 GB 1663.09GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133523 Approved by: https://github.com/nmacchioni	2024-08-15 09:27:11 +00:00
Rachel Guo	c17d26c3c1	[AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016 ) Summary: Follow up small diff to fix a couple issues: - add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen - other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option Test Plan: ``` AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D60954888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016 Approved by: https://github.com/ColinPeppler	2024-08-13 23:00:29 +00:00
Feng Shi	19416bf38b	Reland "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 )" (#133291 ) Reland by reverting commit `844103197d`. #131675 failed a few internal tests because it imported a diff version which wasn't rebased on the proper dependent diffs. Reland from OSS only to avoid the out-of-sync issue. Original description from #131675 Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py This is part 2 pull request which deals with the 2nd case above: The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/133291 Approved by: https://github.com/wdvr	2024-08-13 18:18:12 +00:00
PyTorch MergeBot	844103197d	Revert "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 )" This reverts commit `bb6eef8ed1`. Reverted https://github.com/pytorch/pytorch/pull/131675 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/131675#issuecomment-2285069508))	2024-08-12 23:31:16 +00:00
Feng Shi	bb6eef8ed1	[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py This is part 2 pull request which deals with the 2nd case above: - The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. - Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels Differential Revision: D60067757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131675 Approved by: https://github.com/mlazos	2024-08-09 03:14:16 +00:00
Isuru Fernando	de288e2203	Fix inf value reduction in non persistent reduction for scans (#132293 ) Fixes https://github.com/pytorch/pytorch/issues/132107 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132293 Approved by: https://github.com/peterbell10	2024-08-08 19:02:32 +00:00
Rachel Guo	5709375d56	[AOTI][tooling][1/n] Add intermediate value debug printer (#132323 ) Summary: Context: Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866) The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process. Changes: 1. Added a simple initial debug printer helper to print out tensor values 2. Added a filter option to selectively dump tensor values. Usage: Sample cmd : ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda ``` Sample outputs : ``` [ before_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -2.25655 Max value: 2.32996 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -12.0839 Max value: 11.6878 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)] . ---------------------------------------------------------------------- Ran 1 test in 10.867s OK ``` The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below: ``` torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out'] ``` In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further `torch.load()`. Test Plan: Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part) `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda` Differential Revision: D60538496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323 Approved by: https://github.com/ColinPeppler	2024-08-08 01:39:59 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
Edward Z. Yang	837898d9c8	Stop using preserve_rng_state as decorator (#132774 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132774 Approved by: https://github.com/albanD	2024-08-07 01:07:12 +00:00
Feng Shi	55b0c39d82	Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132182 ) Summary: Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)"" The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests. The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause. See D54134695 or #124969 for more details. Test Plan: Originally failed tests f585704630 f585733786 Diff patched: f586664028 f586663820 Differential Revision: D60458597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182 Approved by: https://github.com/Yuzhen11	2024-08-05 06:57:30 +00:00
Oguz Ulgen	09f9c256ad	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-04 18:43:37 +00:00
PyTorch MergeBot	f2ddd5e9e0	Revert "Add basic mypy annotations to inductor (#132416 )" This reverts commit `78927d37f6`. Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
Oguz Ulgen	78927d37f6	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-01 20:14:25 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
eellison	f32ab3b9e3	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-08-01 04:37:15 +00:00
Peter Bell	260c991e20	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-31 21:32:20 +00:00
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit `13d744464f`. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
Yuzhen Huang	5298acb5c7	Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132065 ) Summary: Original commit changeset: 1d8cfdcef69d Original Phabricator Diff: D54134695 back out: D54134695 Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc Reviewed By: zw2326 Differential Revision: D60397377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065 Approved by: https://github.com/zw2326, https://github.com/qchip	2024-07-29 22:48:29 +00:00
eellison	8b507a922a	Mode to emulate amp numerics (#131595 ) ``` # Mode to emulate pytorch eager numerics for lower precision (fp16, bf16) # Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after # For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts # Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging # to emulate the eager numerics. ``` We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching. in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now. This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595 Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel	2024-07-29 22:42:23 +00:00
PyTorch MergeBot	957a89f56c	Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 )" This reverts commit `03760be271`. Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](`03760be271`) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))	2024-07-29 15:52:08 +00:00
Peter Bell	03760be271	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-29 03:14:13 +00:00
Peter Bell	16cd1aaa1d	[inductor] Improve sort kernel perf (#131719 ) Closes #129507 This makes two changes to the sort kernel: 1. Use int16 for the indices since we only operate on small dims anyway 2. Instead of passing an explicit mask, we pass the rnumel and imply the mask from that which saves an additional reduction in the sort kernel's inner loop. In my benchmarks, this gives enough of a perf improvement to bump up the max rblock to 512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719 Approved by: https://github.com/eellison	2024-07-26 21:56:47 +00:00
Peter Bell	2784b3f1b7	[inductor] Fix split-scan interaction with multi-kernel (#131044 ) This fixes a couple errors that come up when multi-kernel is used with split-scan. 1. The split-scan was being marked as a persistent kernel, which allowed a multi-kernel to be created but this isn't supported. Fix is to never mark split-scan as persistent. 2. Benchmark codegen was not handling WorkspaceArg, and would raise a KeyError during codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044 Approved by: https://github.com/shunting314	2024-07-25 11:36:36 +00:00
Feng Shi	404d640c39	[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. Consolidation with Foreach kernel: 1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode 2) The backend kernel is consolidated into ComboKernel. (Note: this is part 1 which only deals with the 1st case above.) Details: 1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels). 2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it. 2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py 3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. 4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Example: - element wise kernels original Pytorch function: ``` def test_activations(a, b, c): a1 = torch.nn.functional.relu(a) b1 = torch.nn.functional.sigmoid(b) c1 = torch.nn.functional.tanh(c) return a1, b1, c1 ``` combokernel ``` triton_heuristics.pointwise( size_hints=[512], tile_hint=TileHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'fp32', 5: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = triton_helpers.maximum(0, tmp0) tl.store(out_ptr0 + (x0), tmp1, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 400 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x1 = xindex tmp2 = tl.load(in_ptr1 + (x1), xmask) tmp3 = tl.sigmoid(tmp2) tl.store(out_ptr1 + (x1), tmp3, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 100 rnumel = 1 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex tmp4 = tl.load(in_ptr2 + (x2), xmask) tmp5 = libdevice.tanh(tmp4) tl.store(out_ptr2 + (x2), tmp5, xmask) else: pass ``` - reduction kernels Original Pytorch function: ``` def test_reduce(a, b, c): a1 = torch.sum(a, dim=0) b1 = torch.max(b, dim=0) c1 = torch.min(c, dim=0) return a1, b1, c1 ``` Generated combokernal: ``` triton_heuristics.persistent_reduction( size_hints=[32, 32], reduction_hint=ReductionHint.DEFAULT, filename=__file__, triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'i64', 5: 'fp32', 6: 'i64', 7: 'fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]}, inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []} ) triton.jit def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr): pid = tl.program_id(0) if pid % 3 == 0: pid_offset = pid // 3 xnumel = 20 rnumel = 20 RBLOCK_0: tl.constexpr = 32 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_0)[None, :] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (20r1)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0]) tmp3 = tl.where(rmask & xmask, tmp1, float("-inf")) tmp4 = triton_helpers.max2(tmp3, 1)[:, None] tmp6 = tl.broadcast_to(rindex, tmp3.shape) _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1) tmp5 = tmp5_tmp[:, None] tl.store(out_ptr0 + (x0), tmp4, xmask) tl.store(out_ptr1 + (x0), tmp5, xmask) elif pid % 3 == 1: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_1: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_1)[None, :] roffset = 0 rmask = rindex < rnumel r3 = rindex x2 = xindex tmp7 = tl.load(in_ptr1 + (x2 + (10r3)), rmask & xmask, other=0.0) tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1]) tmp10 = tl.where(rmask & xmask, tmp8, float("inf")) tmp11 = triton_helpers.min2(tmp10, 1)[:, None] tmp13 = tl.broadcast_to(rindex, tmp10.shape) _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1) tmp12 = tmp12_tmp[:, None] tl.store(out_ptr2 + (x2), tmp11, xmask) tl.store(out_ptr3 + (x2), tmp12, xmask) elif pid % 3 == 2: pid_offset = pid // 3 xnumel = 10 rnumel = 10 RBLOCK_2: tl.constexpr = 16 xoffset = pid_offset XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK_2)[None, :] roffset = 0 rmask = rindex < rnumel r5 = rindex x4 = xindex tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0) tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2]) tmp17 = tl.where(rmask & xmask, tmp15, 0) tmp18 = tl.sum(tmp17, 1)[:, None] tl.store(out_ptr4 + (x4), tmp18, xmask) else: pass ``` Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes. Test Plan: ``` buck2 test mode/dev-nosan caffe2/test/inductor:foreach ``` ``` buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels ``` Differential Revision: D54134695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969 Approved by: https://github.com/mlazos	2024-07-23 17:34:28 +00:00
Yueming Hao	979429ca89	[inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883 ) Fixes #126338 ## Issue Summary When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks. See more great discussions in the #126338 This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations. ## Fix The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation. When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation. ## Example ```python @torch.compile def fn(x, y): x = x.view(torch.float16) y = y.view(torch.float16) + 1 return x @ y x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16) fn(x, y) ``` The output code generated before this fix is like the following. ```python triton_poi_fused_add_view_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tl.store(out_ptr0 + (x0), tmp1, xmask) triton_poi_fused_add_view_1... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0) del arg0_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view] triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0) del arg1_1 buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view] extern_kernels.mm(buf0, buf1, out=buf2) ``` As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion. The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth. The following are output code generated after this PR. ```python triton_poi_fused_add_0... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32) tmp1 = tmp0.to(tl.bfloat16).to(tl.float32) tmp2 = 1.0 tmp3 = tmp1 + tmp2 tl.store(out_ptr0 + (x0), tmp3, xmask) def call(args): ... triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0) del arg1_1 buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16) # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm] extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1) ``` The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too. The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision. `7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)` `7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)` Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire . Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883 Approved by: https://github.com/eellison	2024-07-23 17:31:39 +00:00
eellison	16a2a1aad3	Annotate graph.py (#131400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400 Approved by: https://github.com/shunting314	2024-07-23 07:04:12 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
peaceorwell	6657b14a64	[inductor] Fix the method for checking the variable type of entry.numel (#131026 ) The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer. Co-authored-by: peterbell10 <peterbell10@live.co.uk> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026 Approved by: https://github.com/peterbell10	2024-07-19 22:51:11 +00:00
chilli	31fc5b8966	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-03 02:34:03 +00:00
PyTorch MergeBot	03440a1c13	Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846 )" This reverts commit `badc638eb6`. Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))	2024-07-02 15:25:34 +00:00
chilli	badc638eb6	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-02 09:31:38 +00:00
Jason Ansel	b93bf55b6a	[halide-backend] Add GPU support (#127506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026	2024-06-29 14:06:21 +00:00
Andres Lugo	b9a1c2c991	[ROCm] Enable F8 Inductor Unit tests (#128353 ) First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-26 18:30:43 +00:00

1 2 3 4 5 ...

447 Commits