pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Bert Maher	d3d85e1c3b	Emit torch.cuda.synchronize() after every kernel call in inductor (#90472 ) Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1 and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace. doesn't necessarily guarantee that you'll get a stack trace pointing to the right kernel. This diff adds a config option to force a CUDA synchronize after every kernel call in inductor, for debugging those tricky cases. Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/) Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472 Approved by: https://github.com/jansel	2022-12-12 04:35:10 +00:00
blzheng	f9aa099074	[Inductor] fix issue: redeclaration of float g_tmp_buffer_xxx (#90270 ) This pr is to fix the issue: redeclaration of 'float g_tmp_buffer_in_ptr1[16] = {0};' If a bool or uint8 tensor is used by multiple op, this tensor will be loaded multiple times. On load, it writes the declaration of this variable, i.e., `self.loads.writeline(f"float {g_tmp_buf}[{nelements}] = {{0}};")`, which will introduce redeclaration error. ![image](https://user-images.githubusercontent.com/69951214/205869956-5c325761-dc09-4aa8-a9ed-fad7f4c85917.png) ![image](https://user-images.githubusercontent.com/69951214/205870695-ee252f17-8f54-484f-9b0a-3a424c479327.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90270 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2022-12-10 12:59:30 +00:00
PyTorch MergeBot	b2795d3c4e	Revert "[inductor] New approach for computing triton load/store masks (#89566 )" This reverts commit `c6c2de586d`. Reverted https://github.com/pytorch/pytorch/pull/89566 on behalf of https://github.com/clee2000 due to broke test_invalid_operand_issue1_cuda in inductor/test_torchinductor on https://github.com/pytorch/pytorch/actions/runs/3657444733/jobs/6181700572	2022-12-09 19:36:25 +00:00
PyTorch MergeBot	6581063583	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" This reverts commit `db0ce4acf3`. Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board	2022-12-09 16:32:25 +00:00
Fabio Rocha	c6c2de586d	[inductor] New approach for computing triton load/store masks (#89566 ) This PR changes the way masks for loads/stores are computed in triton backend of inductor. New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and `r3` it adds `xmask`, `ymask` and `rmask` respectively. For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created. I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89566 Approved by: https://github.com/jansel, https://github.com/ngimel	2022-12-09 12:43:19 +00:00
Mark Saroufim	db0ce4acf3	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-12-09 04:32:31 +00:00
William Wen	d224ac7f77	Remove logging.CODE (#90234 ) Fixes https://github.com/pytorch/torchdynamo/issues/1932 Discussed with @mlazos: if we still want to separate streams for code logging and the rest of info, we can use a separate logger object with a unique name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90234 Approved by: https://github.com/ezyang	2022-12-06 22:24:43 +00:00
Nikita Karetnikov	226e803ecb	[Inductor] handle non-positive exponents in `Pow` (#90146 ) Fixes #90125. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90146 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-12-05 09:16:35 +00:00
Elias Ellison	acd68f9097	[Reland] dont clone args (#89766 ) Reland of https://github.com/pytorch/pytorch/pull/89519. Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning because of the 250mb cache clearing in triton benchmarking. Reland bc previously we weren't accounting for inplace buffer reuse correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89766 Approved by: https://github.com/jansel	2022-12-02 17:20:40 +00:00
Jean Schmidt	f62e54df8f	Reland "Dynamo, FX, Inductor Progress Bars (#88384 )" … (#90055 ) This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly. Original commit: #88384 (`011452a2a1`) Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3) Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): `cf3c3f2280` Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055 Approved by: https://github.com/DanilBaibak, https://github.com/malfet	2022-12-02 13:28:00 +00:00
PyTorch MergeBot	cf3c3f2280	Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 )" This reverts commit `bcf4292f04`. Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits	2022-12-02 09:57:31 +00:00
Wang, Eikan	0bde810572	Add more debug information for Inductor (#90008 ) - Add graph index to the profile information of the Inductor kernel for better debugability. The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation. - Label arbitrary code ranges for `eager` and `opt` modes for better debugability The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-02 09:34:48 +00:00
Elias Ellison	6addc8d923	[Inductor] add expm1 lowering (#89961 ) Improves perf of inductor no-cudagraphs on nvidia-deeprecommender from 0.88 -> .96. I am looking into disabling implicit fallbacks for benchmark models in another pr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89961 Approved by: https://github.com/ngimel	2022-12-02 04:29:54 +00:00
Animesh Jain	d09c52e4fd	[inductor] Deterministic kernel names (#89713 ) `node.origins` is a set and does not have an order. Therefore, inductor w and w/o cudagraphs experiments generate different kernel names, making it hard to debug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89713 Approved by: https://github.com/soumith, https://github.com/mlazos, https://github.com/ngimel	2022-12-02 02:37:36 +00:00
Eli Uriegas	bcf4292f04	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 ) This breaks in environments that use the fake tqdm `015b05af18/torch/hub.py (L26)` which doesn't support the 'desc' kwarg and is not iterable Original try using pytorchbot did not go through because of a merge conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489 This reverts commit `011452a2a1`. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018 Approved by: https://github.com/drisspg, https://github.com/dbort	2022-12-01 20:17:07 +00:00
Wu, Chunyuan	a6caa9c54b	Add a cpp wrapper for Inductor (#88167 ) ## Description Implements https://github.com/pytorch/torchdynamo/issues/1556. This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting: ```python from torch._inductor import config config.cpp_wrapper = True ``` ### Example The main part of the generated code: ```python from torch.utils.cpp_extension import load_inline wrapper = ( ''' #include <dlfcn.h> #include <assert.h> std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) { at::Tensor arg0_1, arg1_1; std::tie(arg0_1, arg1_1) = args; auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float); auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float); auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW); assert(kernel0_lib != nullptr); void (kernel0)(const float,const float,float,float); (void *) (&kernel0) = dlsym(kernel0_lib, "kernel"); kernel0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float*)(buf1.data_ptr())); arg0_1.reset(); arg1_1.reset(); return std::make_tuple(buf0, buf1); }''' ) module = load_inline( name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu', cpp_sources=[wrapper], functions=['call_0'], extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'], extra_ldflags=['-shared -lgomp'], extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m']) def _wrap_func(f): def g(args): return f(args) return g call = _wrap_func(module.call_0) ``` ### Next steps The below items will be addressed in upcoming PRs. - [x] Support Reduction: #88561 - [x] Support None: #88560 - [ ] Support ExternKernel - [x] ATen GEMM-related OPs: #88667 - [ ] ATen Conv - [ ] Conv/GEMM fusion OPs - [x] Cache the kernel loading part: #89742 - [ ] De-allocate input buffers when possible by leveraging CPython APIs - [ ] Support Constant Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-11-30 13:40:47 +00:00
Wang, Eikan	92f08f09d8	Vectorize erf (#89837 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89837 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2022-11-30 06:42:36 +00:00
Mark Saroufim	011452a2a1	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-11-30 06:07:14 +00:00
Jiong Gong	c75434ed4f	[Inductor] Add an option to mark wrapper call in PyTorch profiler (#89674 ) This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674 Approved by: https://github.com/jansel	2022-11-29 00:58:46 +00:00
Jiong Gong	bb77accb4c	[Inductor] Record cpp kernel in PyTorch Profiler (#89367 ) Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367 Approved by: https://github.com/jansel	2022-11-26 14:06:44 +00:00
Natalia Gimelshein	3e20d023b1	put descriptive kernel names behind config (#89697 ) Per title, generated kernel names are often long and confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89697 Approved by: https://github.com/Chillee	2022-11-26 03:08:23 +00:00
Natalia Gimelshein	61a3fe4b64	make inductor correctly propagate nans for maximum and minimum (#89612 ) Partially fixes https://github.com/pytorch/torchdynamo/issues/594 Also, small cleanup for `where` codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612 Approved by: https://github.com/soumith, https://github.com/jansel	2022-11-25 19:42:38 +00:00
Edward Z. Yang	0884fdaba0	Revert "Dont clone unmutated args in triton autotuning (#89519 )" (#89652 ) This reverts commit `f18f0c70ab`. Testing to see if this fixes gmixer_24_224 mixer_b16_224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89652 Approved by: https://github.com/eellison	2022-11-24 22:49:09 +00:00
Elias Ellison	f18f0c70ab	Dont clone unmutated args in triton autotuning (#89519 ) Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning. Any other pointers on where the overhead is coming from in autotuning would be great. Edit: i think it's just the triton cache clearing `44f577984d/python/triton/testing.py (L159)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89519 Approved by: https://github.com/ngimel, https://github.com/jansel	2022-11-23 22:00:03 +00:00
Animesh Jain	1cfd3858ac	[inductor] Use dense masks for indirect indexing (#89524 ) Fixes https://github.com/pytorch/torchdynamo/issues/1654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89524 Approved by: https://github.com/jansel	2022-11-23 00:48:00 +00:00
Bin Bao	2823fc5e4c	[inductor] generate nan in the cpp backend (#89289 ) Summary: Fixes https://github.com/pytorch/torchdynamo/issues/1797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89289 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5	2022-11-22 15:54:04 +00:00
Wang, Eikan	40cf214f2d	Support masked_fill to address the GPT2 performance issue (#89274 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89274 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-22 04:12:43 +00:00
Peter Bell	1267dcf297	[inductor] Fix nan handling for aten.sign (#88937 ) ATen gives `sign(nan) == 0` but inductor's cuda codegen would give `sign(nan) == 1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88937 Approved by: https://github.com/ngimel	2022-11-21 20:56:40 +00:00
Wang, Eikan	bc716383a6	Redefine the simdlen semantic (#89263 ) This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`. Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows. - _simdlen = None_: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2. - _simdlen <=1_: Explicitly disable SIMD - _simdlen > 1_: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-21 09:08:16 +00:00
Natalia Gimelshein	51e961dd7b	use std/libdevice erf in inductor (#89388 ) By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel). Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388 Approved by: https://github.com/jansel	2022-11-21 00:58:03 +00:00
PyTorch MergeBot	706f791a19	Revert "Support masked_fill (#88736 )" This reverts commit `2b131b1d43`. Reverted https://github.com/pytorch/pytorch/pull/88736 on behalf of https://github.com/kit1980 due to Inductor tests are failing with AttributeError: module 'torch._inductor.codecache' has no attribute 'valid_vec_isa_list'	2022-11-17 18:27:08 +00:00
Wang, Eikan	2b131b1d43	Support masked_fill (#88736 ) Support `masked_fill` to address the GPT2 performance issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88736 Approved by: https://github.com/jansel, https://github.com/jgong5	2022-11-17 15:18:29 +00:00
PyTorch MergeBot	4e1d19c5a5	Revert "Redefine the simdlen semantic: (#88482 )" This reverts commit `fce6d6b3dc`. Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554	2022-11-17 04:58:53 +00:00
Wang, Eikan	fce6d6b3dc	Redefine the simdlen semantic: (#88482 ) This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`. Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows. - _simdlen = None_: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2. - _simdlen <=1_: Explicitly disable SIMD - _simdlen > 1_: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-17 03:27:54 +00:00
Fabio Rocha	9262d18e1b	[inductor] Introduce CSEVariable type and use it to track if Triton variables are scalar (#88347 ) This fixes https://github.com/pytorch/torchdynamo/issues/1515 To fix it, we need to keep track of whether a Triton variable is a scalar (so we can not use a mask when doing indirect loads through them). This requires a way of annotating variable names generated by CSE with properties. So now CSE will use CSEVariable class to keep track of variables and let backends subclass it so they can annotate them with whatever information they want. TritonCSEVariable is such a subclass that track the `is_scalar` property. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88347 Approved by: https://github.com/jgong5, https://github.com/ngimel	2022-11-15 20:52:37 +00:00
Jongsoo Park	0544a32ba3	[inductor] fix could not find as_strided with config.triton.mm=triton (#88946 ) Summary: ReinterpretView doesn't seem to be handled properly with matrix multiply Triton kernels Reviewed By: bertmaher Differential Revision: D40836677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88946 Approved by: https://github.com/jansel	2022-11-15 00:48:49 +00:00
Michael Lazos	c1553880de	Have kernel names include fused ops (#88624 ) - Propagates origin fx nodes through inlining during lowering - Concatenates op names into kernel name - Adds config to cap the number of ops in the kernel name so they don't get too long Caveats: - The ordering in the name may not match the order that the ops are executed in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-11-10 21:38:06 +00:00
blzheng	fca6ed02b9	[Inductor] fix c++ compile error with masked float value init (#88298 ) Fixes #88201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88298 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-09 10:40:25 +00:00
Peter Bell	8e2627d42f	[inductor] Fix aten.fmod lowering (#88602 ) Currently the lowering for aten.fmod promotes integral types to float and calls `tl.libdevice.fmod` whereas the ATen behavior is to use the modulo operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88602 Approved by: https://github.com/jansel	2022-11-08 20:27:36 +00:00
Wang, Eikan	ad27d762a7	Support sign for HF models like ElectraForQuestionAnswering (#88160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88160 Approved by: https://github.com/jansel	2022-11-07 09:10:37 +00:00
Wang, Eikan	a9d37ce8f5	Support reduction vectorization (#87356 ) This PR is to optimize reduction implementation by `at::vec`. The main idea is as same as the aten implementation. - Step1: Parallelize and vectorize the reduction implementation - Step2: Invoke `at::vec::vec_reduce_all` to reduce the vector generated at step 1 to a single scalar - Step3: Handle the tail elements For the implementation, we create two kernels - `CppVecKernel` and `CppKernel`. The code block generation is as follows step by step. - Gen the non-reduction loop - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1008-L1010) - Gen the reduction initialization both for vectorization and non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1015) - Gen the reduction loop for the vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1021-L1023) - Gen the code to reduce the vector to scalar - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1033) - Gen the reduction loop for the non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1042) - Do some post-reduction things like store reduction value - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1049) ```python # Gen the non-reduction loop for loop in CppVecKernel.NoneReductionLoop: # Gen the reduction initialization both for vectorization and non-vectorization kernel CppVecKernel.ReductionPrefix # Gen the reduction loop for the vectorization kernel for loop in CppVecKernel.ReductionLoop CppVecKernel.Loads CppVecKernel.Compute CppVecKernel.Stores # Gen the code to reduce the vector to scalar CppVecKernel.ReductionSuffix # Gen the reduction loop for the non-vectorization kernel for loop in CppKernel.ReductionLoop CppKernel.Loads CppKernel.Compute CppKernel.Stores # The reduction is almost finished. To do some post-reduction things like store reduction value. CppKernel.ReductionSuffix ``` The code snippet for maximum reduction exemplifies the idea. More detailed comments are inlined. ```C++ { // Declare reduction for at::vec::Vectorized since it is not built-in data type. #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp4 = 0; // tmp4_vec is used to vectorize the sum reduction for tmp4 auto tmp4_vec = at::vec::Vectorized<float>(tmp4); float tmp6 = 0; // tmp6_vec is used to vectorize the sum reduction for tmp6 auto tmp6_vec = at::vec::Vectorized<float>(tmp6); #pragma omp parallel num_threads(48) { // Parallelize the vectorized reduction #pragma omp for reduction(+:tmp4_vec) reduction(+:tmp6_vec) for(long i0=0; i0<192; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8i0); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2.abs(); auto tmp5 = tmp2 * tmp2; tmp4_vec += tmp3; tmp6_vec += tmp5; } // Reduce the tmp4_vec as a scalar and store at tmp4 tmp4 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp4_vec); // Reduce the tmp6_vec as a scalar and store at tmp6 tmp6 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp6_vec); // Handle the tail elements that could not be vectorized by aten. #pragma omp for simd simdlen(4) reduction(+:tmp4) reduction(+:tmp6) for(long i0=1536; i0<1536; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 - tmp1; auto tmp3 = std::abs(tmp2); auto tmp5 = tmp2 * tmp2; tmp4 += tmp3; tmp6 += tmp5; } } out_ptr0[0] = tmp4; out_ptr1[0] = tmp6; } ``` Performance(Measured by operatorbench and the base line of speedup ratio is aten operator performance): Softmax (1,16,384,384,dim=3) \| Speedup ratio (simdlen=None) \| Speedup ratio (simdlen=8) + this PR -- \| -- \| -- 24c \| 0.37410838067524177 \| 0.9036240100351164 4c \| 0.24655829520907663 \| 1.0255329993674518 1c \| 0.21595768114988007 \| 1.000587368005134 HW Configuration: SKU: SKX Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz MemTotal: 196708148 kB MemFree: 89318532 kB MemBandwidth: 112195.1MB/S Pull Request resolved: https://github.com/pytorch/pytorch/pull/87356 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-07 06:40:34 +00:00
Wang, Eikan	6541e51ffd	Explicit vectorization support for TorchInductor (#87068 ) In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8. ```C++ extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, const long ks0, const long ks1) { #pragma omp parallel num_threads(48) { #pragma omp for for(long i0=0; i0<((ks0ks1) / 8); ++i0) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8i0); auto tmp2 = tmp0 + tmp1; auto tmp3 = tmp2.exp(); tmp3.store(out_ptr0 + 8i0); } #pragma omp for simd simdlen(4) for(long i0=8(((ks0ks1) / 8)); i0<ks0*ks1; ++i0) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 + tmp1; auto tmp3 = std::exp(tmp2); out_ptr0[i0] = tmp3; } } } ``` The major pipeline is as follows. - Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](`bf66991fc4/torch/_inductor/codegen/cpp.py (L702)`)is to check whether all the `ops` have been supported. The [other one](`355326faa3/torch/_inductor/codegen/cpp.py (L672)`) is to check whether the data access could be vectorized. - [`CppSimdVecKernelChecker`](`355326faa3/torch/_inductor/codegen/cpp.py (L655)`) - Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized. - [`CppSimdVecKernel`](`355326faa3/torch/_inductor/codegen/cpp.py (L601)`) - [`CppSimdVecOverrides`](`355326faa3/torch/_inductor/codegen/cpp.py (L159)`): The ops that we have supported on the top of `aten::vec` - Create kernel - [`aten::vec` kernel](`355326faa3/torch/_inductor/codegen/cpp.py (L924)`) - [`Original CPP kernel - OMP SIMD`](`355326faa3/torch/_inductor/codegen/cpp.py (L929)`) - Generate code - [`CppKernelProxy`](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) is used to combine the `aten::vec` kernel and original cpp kernel - [Vectorize the most inner loop](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) - [Generate code](`355326faa3/torch/_inductor/codegen/cpp.py (L821)`) Next steps: - [x] Support reduction - [x] Vectorize the tail loop with `aten::vec` - [ ] Support BF16 - [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-07 06:24:14 +00:00
Natalia Gimelshein	b4fcfe77b2	reduce the number of autotuning iterations, don't autotune simple til… (#88386 ) …ed copies Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s. Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386 Approved by: https://github.com/jansel	2022-11-03 15:58:18 +00:00
Fabio Rocha	4ab5d79b28	[inductor] Updated some triton.libdevice calls (#88242 ) triton master now does not require `d` or `f` suffix to some libdevice function calls - it dispatches to right library call based on argument type. triton pin updated to `f16138d447` Also removed some xfails for some unrelated tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88242 Approved by: https://github.com/ngimel	2022-11-02 04:58:43 +00:00
Bin Bao	4e3a0ff92e	Update how inductor cpu tests are skipped on fbcode (#87867 ) cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87867 Approved by: https://github.com/anijain2305	2022-10-28 00:33:54 +00:00
PyTorch MergeBot	6cc4ae3d2d	Revert "[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809 )" This reverts commit `369755f8ce`. Reverted https://github.com/pytorch/pytorch/pull/87809 on behalf of https://github.com/kit1980 due to Broke trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 4, 4, linux.g5.4xlarge.nvidia.gpu), same error on pull.	2022-10-27 23:55:59 +00:00
Yanbo Liang	369755f8ce	[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809 ) Fixes #ISSUE_NUMBER cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809 Approved by: https://github.com/ngimel	2022-10-27 20:58:48 +00:00
William Wen	a605a30732	Fix CODE level usage in dynamo config.py (#87522 ) Fixes https://github.com/pytorch/torchdynamo/issues/1718. Tested by changing `log_level = logging.WARNING` in config.py to `log_level = logging.CODE` and running a test script that doesn't touch `log_level`. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87522 Approved by: https://github.com/mlazos	2022-10-25 22:47:54 +00:00
stumpOS	8a2a4ed488	consider numel args when identifying aligned args (#87394 ) Fixes #ISSUE_NUMBER https://github.com/pytorch/torchdynamo/issues/1527 cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87394 Approved by: https://github.com/jansel	2022-10-25 17:00:27 +00:00
Yanbo Liang	9ba632253a	[Inductor] Convert 0d CPU tensor to scalar during triton codegen (#87329 ) This is a follow up to address [this](https://github.com/pytorch/torchdynamo/pull/1284#pullrequestreview-1130319129). We revised to use the codegen approach to handle 0d CPU tensor, which will not support cudagraph any more. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87329 Approved by: https://github.com/ngimel	2022-10-21 01:24:00 +00:00

1 2

57 Commits