Commit Graph

23 Commits

Author SHA1 Message Date
blzheng
f9aa099074 [Inductor] fix issue: redeclaration of float g_tmp_buffer_xxx (#90270)
This pr is to fix the issue: redeclaration of 'float g_tmp_buffer_in_ptr1[16] = {0};'
If a bool or uint8 tensor is used by multiple op, this tensor will be loaded multiple times. On load, it writes the declaration of this variable, i.e., `self.loads.writeline(f"float {g_tmp_buf}[{nelements}] = {{0}};")`, which will introduce redeclaration error.

![image](https://user-images.githubusercontent.com/69951214/205869956-5c325761-dc09-4aa8-a9ed-fad7f4c85917.png)
![image](https://user-images.githubusercontent.com/69951214/205870695-ee252f17-8f54-484f-9b0a-3a424c479327.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90270
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2022-12-10 12:59:30 +00:00
Wang, Eikan
0bde810572 Add more debug information for Inductor (#90008)
- Add graph index to the profile information of the Inductor kernel for better debugability.

  The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation.

- Label arbitrary code ranges for `eager` and `opt` modes for better debugability

  The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 09:34:48 +00:00
Elias Ellison
6addc8d923 [Inductor] add expm1 lowering (#89961)
Improves perf of inductor no-cudagraphs on nvidia-deeprecommender from 0.88 -> .96. I am looking into disabling implicit fallbacks for benchmark models in another pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89961
Approved by: https://github.com/ngimel
2022-12-02 04:29:54 +00:00
Wu, Chunyuan
a6caa9c54b Add a cpp wrapper for Inductor (#88167)
## Description
Implements https://github.com/pytorch/torchdynamo/issues/1556.
This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting:
```python
from torch._inductor import config
config.cpp_wrapper = True
```

### Example
The main part of the generated code:
```python
from torch.utils.cpp_extension import load_inline
wrapper = (
'''
#include <dlfcn.h>
#include <assert.h>
    std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float);
    auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float);
    auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW);
    assert(kernel0_lib != nullptr);
    void (*kernel0)(const float*,const float*,float*,float*);
    *(void **) (&kernel0) = dlsym(kernel0_lib, "kernel");
    kernel0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    return std::make_tuple(buf0, buf1); }''' )

module = load_inline(
    name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu',
    cpp_sources=[wrapper],
    functions=['call_0'],
    extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'],
    extra_ldflags=['-shared  -lgomp'],
    extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m'])

def _wrap_func(f):
    def g(args):
        return f(args)
    return g
call = _wrap_func(module.call_0)
```

### Next steps
The below items will be addressed in upcoming PRs.
- [x] Support Reduction: #88561
- [x] Support None: #88560
- [ ] Support ExternKernel
   - [x] ATen GEMM-related OPs: #88667
   - [ ] ATen Conv
   - [ ] Conv/GEMM fusion OPs
- [x] Cache the kernel loading part: #89742
- [ ] De-allocate input buffers when possible by leveraging CPython APIs
- [ ] Support Constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-11-30 13:40:47 +00:00
Wang, Eikan
92f08f09d8 Vectorize erf (#89837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89837
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2022-11-30 06:42:36 +00:00
Jiong Gong
bb77accb4c [Inductor] Record cpp kernel in PyTorch Profiler (#89367)
Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367
Approved by: https://github.com/jansel
2022-11-26 14:06:44 +00:00
Natalia Gimelshein
61a3fe4b64 make inductor correctly propagate nans for maximum and minimum (#89612)
Partially fixes https://github.com/pytorch/torchdynamo/issues/594
Also, small cleanup for `where` codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-25 19:42:38 +00:00
Bin Bao
2823fc5e4c [inductor] generate nan in the cpp backend (#89289)
Summary: Fixes https://github.com/pytorch/torchdynamo/issues/1797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89289
Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5
2022-11-22 15:54:04 +00:00
Wang, Eikan
40cf214f2d Support masked_fill to address the GPT2 performance issue (#89274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89274
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-22 04:12:43 +00:00
Wang, Eikan
bc716383a6 Redefine the simdlen semantic (#89263)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-21 09:08:16 +00:00
Natalia Gimelshein
51e961dd7b use std/libdevice erf in inductor (#89388)
By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel).
Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388
Approved by: https://github.com/jansel
2022-11-21 00:58:03 +00:00
PyTorch MergeBot
706f791a19 Revert "Support masked_fill (#88736)"
This reverts commit 2b131b1d43.

Reverted https://github.com/pytorch/pytorch/pull/88736 on behalf of https://github.com/kit1980 due to Inductor tests are failing with AttributeError: module 'torch._inductor.codecache' has no attribute 'valid_vec_isa_list'
2022-11-17 18:27:08 +00:00
Wang, Eikan
2b131b1d43 Support masked_fill (#88736)
Support `masked_fill` to address the GPT2 performance issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88736
Approved by: https://github.com/jansel, https://github.com/jgong5
2022-11-17 15:18:29 +00:00
PyTorch MergeBot
4e1d19c5a5 Revert "Redefine the simdlen semantic: (#88482)"
This reverts commit fce6d6b3dc.

Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554
2022-11-17 04:58:53 +00:00
Wang, Eikan
fce6d6b3dc Redefine the simdlen semantic: (#88482)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-17 03:27:54 +00:00
Michael Lazos
c1553880de Have kernel names include fused ops (#88624)
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long

Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-11-10 21:38:06 +00:00
blzheng
fca6ed02b9 [Inductor] fix c++ compile error with masked float value init (#88298)
Fixes #88201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88298
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-09 10:40:25 +00:00
Wang, Eikan
ad27d762a7 Support sign for HF models like ElectraForQuestionAnswering (#88160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88160
Approved by: https://github.com/jansel
2022-11-07 09:10:37 +00:00
Wang, Eikan
a9d37ce8f5 Support reduction vectorization (#87356)
This PR is to optimize reduction implementation by `at::vec`. The main idea is as same as the aten implementation.
- Step1: Parallelize and vectorize the reduction implementation
- Step2: Invoke `at::vec::vec_reduce_all` to reduce the vector generated at step 1 to a single scalar
- Step3: Handle the tail elements

For the implementation, we create two kernels - `CppVecKernel` and `CppKernel`. The code block generation is as follows step by step.

- Gen the non-reduction loop - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1008-L1010)
- Gen the reduction initialization both for vectorization and non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1015)
- Gen the reduction loop for the vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1021-L1023)
- Gen the code to reduce the vector to scalar - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1033)
- Gen the reduction loop for the non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1042)
- Do some post-reduction things like store reduction value - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1049)

```python
# Gen the non-reduction loop
for loop in CppVecKernel.NoneReductionLoop:
    # Gen the reduction initialization both for vectorization and non-vectorization kernel
    CppVecKernel.ReductionPrefix
    # Gen the reduction loop for the vectorization kernel
    for loop in CppVecKernel.ReductionLoop
        CppVecKernel.Loads
        CppVecKernel.Compute
        CppVecKernel.Stores
    # Gen the code to reduce the vector to scalar
    CppVecKernel.ReductionSuffix
    # Gen the reduction loop for the non-vectorization kernel
    for loop in CppKernel.ReductionLoop
        CppKernel.Loads
        CppKernel.Compute
        CppKernel.Stores
    # The reduction is almost finished. To do some post-reduction things like store reduction value.
    CppKernel.ReductionSuffix
```
The code snippet for maximum reduction exemplifies the idea. More detailed comments are inlined.

```C++
    {
        // Declare reduction for at::vec::Vectorized since it is not built-in data type.
        #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})

        float tmp4 = 0;
        // tmp4_vec is used to vectorize the sum reduction for tmp4
        auto tmp4_vec = at::vec::Vectorized<float>(tmp4);
        float tmp6 = 0;
        // tmp6_vec is used to vectorize the sum reduction for tmp6
        auto tmp6_vec = at::vec::Vectorized<float>(tmp6);
        #pragma omp parallel num_threads(48)
        {
            // Parallelize the vectorized reduction
            #pragma omp for reduction(+:tmp4_vec) reduction(+:tmp6_vec)
            for(long i0=0; i0<192; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
                auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
                auto tmp2 = tmp0 - tmp1;
                auto tmp3 = tmp2.abs();
                auto tmp5 = tmp2 * tmp2;
                tmp4_vec += tmp3;
                tmp6_vec += tmp5;
            }
            // Reduce the tmp4_vec as a scalar and store at tmp4
            tmp4 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp4_vec);
            // Reduce the tmp6_vec as a scalar and store at tmp6
            tmp6 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp6_vec);
            // Handle the tail elements that could not be vectorized by aten.
            #pragma omp for simd simdlen(4) reduction(+:tmp4) reduction(+:tmp6)
            for(long i0=1536; i0<1536; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = in_ptr1[i0];
                auto tmp2 = tmp0 - tmp1;
                auto tmp3 = std::abs(tmp2);
                auto tmp5 = tmp2 * tmp2;
                tmp4 += tmp3;
                tmp6 += tmp5;
            }
        }
        out_ptr0[0] = tmp4;
        out_ptr1[0] = tmp6;
    }
```

Performance(Measured by operatorbench and the base line of speedup ratio is aten operator performance):
Softmax (1,16,384,384,dim=3) | Speedup ratio (simdlen=None) |  Speedup ratio (simdlen=8) + this PR
-- | -- | --
24c | 0.37410838067524177 | 0.9036240100351164
4c | 0.24655829520907663 | 1.0255329993674518
1c | 0.21595768114988007 | 1.000587368005134

HW Configuration:
SKU: SKX Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
MemTotal:       196708148 kB
MemFree:        89318532 kB
MemBandwidth:  112195.1MB/S

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87356
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:40:34 +00:00
Wang, Eikan
6541e51ffd Explicit vectorization support for TorchInductor (#87068)
In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8.

```C++
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       const long ks0,
                       const long ks1)
{
    #pragma omp parallel num_threads(48)
    {
        #pragma omp for
        for(long i0=0; i0<((ks0*ks1) / 8); ++i0)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = tmp2.exp();
            tmp3.store(out_ptr0 + 8*i0);
        }
        #pragma omp for simd simdlen(4)
        for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0)
        {
            auto tmp0 = in_ptr0[i0];
            auto tmp1 = in_ptr1[i0];
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = std::exp(tmp2);
            out_ptr0[i0] = tmp3;
        }
    }
}

```

The major pipeline is as follows.
- Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](bf66991fc4/torch/_inductor/codegen/cpp.py (L702))is to check whether all the `ops` have been supported. The [other one](355326faa3/torch/_inductor/codegen/cpp.py (L672)) is to check whether the data access could be vectorized.
  - [`CppSimdVecKernelChecker`](355326faa3/torch/_inductor/codegen/cpp.py (L655))
- Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized.
  - [`CppSimdVecKernel`](355326faa3/torch/_inductor/codegen/cpp.py (L601))
  - [`CppSimdVecOverrides`](355326faa3/torch/_inductor/codegen/cpp.py (L159)): The ops that we have supported on the top of `aten::vec`
  - Create kernel
    - [`aten::vec` kernel](355326faa3/torch/_inductor/codegen/cpp.py (L924))
    - [`Original CPP kernel - OMP SIMD`](355326faa3/torch/_inductor/codegen/cpp.py (L929))
- Generate code
  - [`CppKernelProxy`](355326faa3/torch/_inductor/codegen/cpp.py (L753)) is used to combine the `aten::vec` kernel and original cpp kernel
    - [Vectorize the most inner loop](355326faa3/torch/_inductor/codegen/cpp.py (L753))
    - [Generate code](355326faa3/torch/_inductor/codegen/cpp.py (L821))

Next steps:
- [x] Support reduction
- [x] Vectorize the tail loop with `aten::vec`
- [ ] Support BF16
- [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:24:14 +00:00
Horace He
2418ddb1ec Unified symbolic shape variables between Inductor and AOTDispatcher (#87161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161
Approved by: https://github.com/jansel
2022-10-19 04:50:34 +00:00
Fabio Rocha
e4285f09b9 [inductor] new way to compile f64 libdevice calls (#87189)
Porting over [torchdynamo/#1633](https://github.com/pytorch/torchdynamo/pull/1633)

`torch/_inductor/codegen/triton.py` now defines `libdevice_<function>` variants
of some functions. You can request dispatch to those for
float64 dtypes when using `register_pointwise` by setting
`use_libdevice_for_f64=True`.

Other minor changes:
    - In triton, sigmoid now codegens tl.sigmoid
    - silu now comes from decomp, not lowering
    - Some test skips no longer necessary, removed or made xfails

Switching to `tl.sigmoid` has exactly same performance.
Moving `silu` to decomp does not change anything, same triton code is generated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87189
Approved by: https://github.com/ngimel
2022-10-18 19:13:11 +00:00
Jason Ansel
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00