pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
blzheng	f9aa099074	[Inductor] fix issue: redeclaration of float g_tmp_buffer_xxx (#90270 ) This pr is to fix the issue: redeclaration of 'float g_tmp_buffer_in_ptr1[16] = {0};' If a bool or uint8 tensor is used by multiple op, this tensor will be loaded multiple times. On load, it writes the declaration of this variable, i.e., `self.loads.writeline(f"float {g_tmp_buf}[{nelements}] = {{0}};")`, which will introduce redeclaration error. ![image](https://user-images.githubusercontent.com/69951214/205869956-5c325761-dc09-4aa8-a9ed-fad7f4c85917.png) ![image](https://user-images.githubusercontent.com/69951214/205870695-ee252f17-8f54-484f-9b0a-3a424c479327.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90270 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2022-12-10 12:59:30 +00:00
Wang, Eikan	0bde810572	Add more debug information for Inductor (#90008 ) - Add graph index to the profile information of the Inductor kernel for better debugability. The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation. - Label arbitrary code ranges for `eager` and `opt` modes for better debugability The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-02 09:34:48 +00:00
Elias Ellison	6addc8d923	[Inductor] add expm1 lowering (#89961 ) Improves perf of inductor no-cudagraphs on nvidia-deeprecommender from 0.88 -> .96. I am looking into disabling implicit fallbacks for benchmark models in another pr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89961 Approved by: https://github.com/ngimel	2022-12-02 04:29:54 +00:00
Wu, Chunyuan	a6caa9c54b	Add a cpp wrapper for Inductor (#88167 ) ## Description Implements https://github.com/pytorch/torchdynamo/issues/1556. This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting: ```python from torch._inductor import config config.cpp_wrapper = True ``` ### Example The main part of the generated code: ```python from torch.utils.cpp_extension import load_inline wrapper = ( ''' #include <dlfcn.h> #include <assert.h> std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) { at::Tensor arg0_1, arg1_1; std::tie(arg0_1, arg1_1) = args; auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float); auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float); auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW); assert(kernel0_lib != nullptr); void (kernel0)(const float,const float,float,float); (void *) (&kernel0) = dlsym(kernel0_lib, "kernel"); kernel0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float*)(buf1.data_ptr())); arg0_1.reset(); arg1_1.reset(); return std::make_tuple(buf0, buf1); }''' ) module = load_inline( name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu', cpp_sources=[wrapper], functions=['call_0'], extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'], extra_ldflags=['-shared -lgomp'], extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m']) def _wrap_func(f): def g(args): return f(args) return g call = _wrap_func(module.call_0) ``` ### Next steps The below items will be addressed in upcoming PRs. - [x] Support Reduction: #88561 - [x] Support None: #88560 - [ ] Support ExternKernel - [x] ATen GEMM-related OPs: #88667 - [ ] ATen Conv - [ ] Conv/GEMM fusion OPs - [x] Cache the kernel loading part: #89742 - [ ] De-allocate input buffers when possible by leveraging CPython APIs - [ ] Support Constant Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-11-30 13:40:47 +00:00
Wang, Eikan	92f08f09d8	Vectorize erf (#89837 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89837 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2022-11-30 06:42:36 +00:00
Jiong Gong	bb77accb4c	[Inductor] Record cpp kernel in PyTorch Profiler (#89367 ) Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367 Approved by: https://github.com/jansel	2022-11-26 14:06:44 +00:00
Natalia Gimelshein	61a3fe4b64	make inductor correctly propagate nans for maximum and minimum (#89612 ) Partially fixes https://github.com/pytorch/torchdynamo/issues/594 Also, small cleanup for `where` codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612 Approved by: https://github.com/soumith, https://github.com/jansel	2022-11-25 19:42:38 +00:00
Bin Bao	2823fc5e4c	[inductor] generate nan in the cpp backend (#89289 ) Summary: Fixes https://github.com/pytorch/torchdynamo/issues/1797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89289 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5	2022-11-22 15:54:04 +00:00
Wang, Eikan	40cf214f2d	Support masked_fill to address the GPT2 performance issue (#89274 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89274 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-22 04:12:43 +00:00
Wang, Eikan	bc716383a6	Redefine the simdlen semantic (#89263 ) This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`. Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows. - _simdlen = None_: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2. - _simdlen <=1_: Explicitly disable SIMD - _simdlen > 1_: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-21 09:08:16 +00:00
Natalia Gimelshein	51e961dd7b	use std/libdevice erf in inductor (#89388 ) By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel). Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388 Approved by: https://github.com/jansel	2022-11-21 00:58:03 +00:00
PyTorch MergeBot	706f791a19	Revert "Support masked_fill (#88736 )" This reverts commit `2b131b1d43`. Reverted https://github.com/pytorch/pytorch/pull/88736 on behalf of https://github.com/kit1980 due to Inductor tests are failing with AttributeError: module 'torch._inductor.codecache' has no attribute 'valid_vec_isa_list'	2022-11-17 18:27:08 +00:00
Wang, Eikan	2b131b1d43	Support masked_fill (#88736 ) Support `masked_fill` to address the GPT2 performance issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88736 Approved by: https://github.com/jansel, https://github.com/jgong5	2022-11-17 15:18:29 +00:00
PyTorch MergeBot	4e1d19c5a5	Revert "Redefine the simdlen semantic: (#88482 )" This reverts commit `fce6d6b3dc`. Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554	2022-11-17 04:58:53 +00:00
Wang, Eikan	fce6d6b3dc	Redefine the simdlen semantic: (#88482 ) This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`. Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows. - _simdlen = None_: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2. - _simdlen <=1_: Explicitly disable SIMD - _simdlen > 1_: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-17 03:27:54 +00:00
Michael Lazos	c1553880de	Have kernel names include fused ops (#88624 ) - Propagates origin fx nodes through inlining during lowering - Concatenates op names into kernel name - Adds config to cap the number of ops in the kernel name so they don't get too long Caveats: - The ordering in the name may not match the order that the ops are executed in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-11-10 21:38:06 +00:00
blzheng	fca6ed02b9	[Inductor] fix c++ compile error with masked float value init (#88298 ) Fixes #88201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88298 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-09 10:40:25 +00:00
Wang, Eikan	ad27d762a7	Support sign for HF models like ElectraForQuestionAnswering (#88160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88160 Approved by: https://github.com/jansel	2022-11-07 09:10:37 +00:00
Wang, Eikan	a9d37ce8f5	Support reduction vectorization (#87356 ) This PR is to optimize reduction implementation by `at::vec`. The main idea is as same as the aten implementation. - Step1: Parallelize and vectorize the reduction implementation - Step2: Invoke `at::vec::vec_reduce_all` to reduce the vector generated at step 1 to a single scalar - Step3: Handle the tail elements For the implementation, we create two kernels - `CppVecKernel` and `CppKernel`. The code block generation is as follows step by step. - Gen the non-reduction loop - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1008-L1010) - Gen the reduction initialization both for vectorization and non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1015) - Gen the reduction loop for the vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1021-L1023) - Gen the code to reduce the vector to scalar - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1033) - Gen the reduction loop for the non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1042) - Do some post-reduction things like store reduction value - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1049) ```python # Gen the non-reduction loop for loop in CppVecKernel.NoneReductionLoop: # Gen the reduction initialization both for vectorization and non-vectorization kernel CppVecKernel.ReductionPrefix # Gen the reduction loop for the vectorization kernel for loop in CppVecKernel.ReductionLoop CppVecKernel.Loads CppVecKernel.Compute CppVecKernel.Stores # Gen the code to reduce the vector to scalar CppVecKernel.ReductionSuffix # Gen the reduction loop for the non-vectorization kernel for loop in CppKernel.ReductionLoop CppKernel.Loads CppKernel.Compute CppKernel.Stores # The reduction is almost finished. To do some post-reduction things like store reduction value. CppKernel.ReductionSuffix ``` The code snippet for maximum reduction exemplifies the idea. More detailed comments are inlined. ```C++ { // Declare reduction for at::vec::Vectorized since it is not built-in data type. #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp4 = 0; // tmp4_vec is used to vectorize the sum reduction for tmp4 auto tmp4_vec = at::vec::Vectorized<float>(tmp4); float tmp6 = 0; // tmp6_vec is used to vectorize the sum reduction for tmp6 auto tmp6_vec = at::vec::Vectorized<float>(tmp6); #pragma omp parallel num_threads(48) { // Parallelize the vectorized reduction #pragma omp for reduction(+:tmp4_vec) reduction(+:tmp6_vec) for(long i0=0; i0<192; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8i0); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2.abs(); auto tmp5 = tmp2 * tmp2; tmp4_vec += tmp3; tmp6_vec += tmp5; } // Reduce the tmp4_vec as a scalar and store at tmp4 tmp4 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp4_vec); // Reduce the tmp6_vec as a scalar and store at tmp6 tmp6 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp6_vec); // Handle the tail elements that could not be vectorized by aten. #pragma omp for simd simdlen(4) reduction(+:tmp4) reduction(+:tmp6) for(long i0=1536; i0<1536; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 - tmp1; auto tmp3 = std::abs(tmp2); auto tmp5 = tmp2 * tmp2; tmp4 += tmp3; tmp6 += tmp5; } } out_ptr0[0] = tmp4; out_ptr1[0] = tmp6; } ``` Performance(Measured by operatorbench and the base line of speedup ratio is aten operator performance): Softmax (1,16,384,384,dim=3) \| Speedup ratio (simdlen=None) \| Speedup ratio (simdlen=8) + this PR -- \| -- \| -- 24c \| 0.37410838067524177 \| 0.9036240100351164 4c \| 0.24655829520907663 \| 1.0255329993674518 1c \| 0.21595768114988007 \| 1.000587368005134 HW Configuration: SKU: SKX Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz MemTotal: 196708148 kB MemFree: 89318532 kB MemBandwidth: 112195.1MB/S Pull Request resolved: https://github.com/pytorch/pytorch/pull/87356 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-07 06:40:34 +00:00
Wang, Eikan	6541e51ffd	Explicit vectorization support for TorchInductor (#87068 ) In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8. ```C++ extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, const long ks0, const long ks1) { #pragma omp parallel num_threads(48) { #pragma omp for for(long i0=0; i0<((ks0ks1) / 8); ++i0) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8i0); auto tmp2 = tmp0 + tmp1; auto tmp3 = tmp2.exp(); tmp3.store(out_ptr0 + 8i0); } #pragma omp for simd simdlen(4) for(long i0=8(((ks0ks1) / 8)); i0<ks0*ks1; ++i0) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 + tmp1; auto tmp3 = std::exp(tmp2); out_ptr0[i0] = tmp3; } } } ``` The major pipeline is as follows. - Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](`bf66991fc4/torch/_inductor/codegen/cpp.py (L702)`)is to check whether all the `ops` have been supported. The [other one](`355326faa3/torch/_inductor/codegen/cpp.py (L672)`) is to check whether the data access could be vectorized. - [`CppSimdVecKernelChecker`](`355326faa3/torch/_inductor/codegen/cpp.py (L655)`) - Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized. - [`CppSimdVecKernel`](`355326faa3/torch/_inductor/codegen/cpp.py (L601)`) - [`CppSimdVecOverrides`](`355326faa3/torch/_inductor/codegen/cpp.py (L159)`): The ops that we have supported on the top of `aten::vec` - Create kernel - [`aten::vec` kernel](`355326faa3/torch/_inductor/codegen/cpp.py (L924)`) - [`Original CPP kernel - OMP SIMD`](`355326faa3/torch/_inductor/codegen/cpp.py (L929)`) - Generate code - [`CppKernelProxy`](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) is used to combine the `aten::vec` kernel and original cpp kernel - [Vectorize the most inner loop](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) - [Generate code](`355326faa3/torch/_inductor/codegen/cpp.py (L821)`) Next steps: - [x] Support reduction - [x] Vectorize the tail loop with `aten::vec` - [ ] Support BF16 - [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-07 06:24:14 +00:00
Horace He	2418ddb1ec	Unified symbolic shape variables between Inductor and AOTDispatcher (#87161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161 Approved by: https://github.com/jansel	2022-10-19 04:50:34 +00:00
Fabio Rocha	e4285f09b9	[inductor] new way to compile f64 libdevice calls (#87189 ) Porting over [torchdynamo/#1633](https://github.com/pytorch/torchdynamo/pull/1633) `torch/_inductor/codegen/triton.py` now defines `libdevice_<function>` variants of some functions. You can request dispatch to those for float64 dtypes when using `register_pointwise` by setting `use_libdevice_for_f64=True`. Other minor changes: - In triton, sigmoid now codegens tl.sigmoid - silu now comes from decomp, not lowering - Some test skips no longer necessary, removed or made xfails Switching to `tl.sigmoid` has exactly same performance. Moving `silu` to decomp does not change anything, same triton code is generated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87189 Approved by: https://github.com/ngimel	2022-10-18 19:13:11 +00:00
Jason Ansel	c7c09722ad	Move TorchDynamo into PyTorch core (#86461 ) Context: https://github.com/pytorch/torchdynamo/issues/1588 This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core. - `torchdynamo` becomes `torch._dynamo` - `torchinductor` becomes `torch._inductor` This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461 Approved by: https://github.com/voznesenskym	2022-10-13 23:18:06 +00:00

23 Commits