Commit Graph

57 Commits

Author SHA1 Message Date
Bert Maher
d3d85e1c3b Emit torch.cuda.synchronize() after every kernel call in inductor (#90472)
Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1
and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace.
doesn't necessarily guarantee that you'll get a stack trace pointing to the
right kernel.  This diff adds a config option to force a CUDA synchronize after
every kernel call in inductor, for debugging those tricky cases.

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/)

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472
Approved by: https://github.com/jansel
2022-12-12 04:35:10 +00:00
blzheng
f9aa099074 [Inductor] fix issue: redeclaration of float g_tmp_buffer_xxx (#90270)
This pr is to fix the issue: redeclaration of 'float g_tmp_buffer_in_ptr1[16] = {0};'
If a bool or uint8 tensor is used by multiple op, this tensor will be loaded multiple times. On load, it writes the declaration of this variable, i.e., `self.loads.writeline(f"float {g_tmp_buf}[{nelements}] = {{0}};")`, which will introduce redeclaration error.

![image](https://user-images.githubusercontent.com/69951214/205869956-5c325761-dc09-4aa8-a9ed-fad7f4c85917.png)
![image](https://user-images.githubusercontent.com/69951214/205870695-ee252f17-8f54-484f-9b0a-3a424c479327.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90270
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2022-12-10 12:59:30 +00:00
PyTorch MergeBot
b2795d3c4e Revert "[inductor] New approach for computing triton load/store masks (#89566)"
This reverts commit c6c2de586d.

Reverted https://github.com/pytorch/pytorch/pull/89566 on behalf of https://github.com/clee2000 due to broke test_invalid_operand_issue1_cuda in inductor/test_torchinductor on https://github.com/pytorch/pytorch/actions/runs/3657444733/jobs/6181700572
2022-12-09 19:36:25 +00:00
PyTorch MergeBot
6581063583 Revert "Dynamo, FX, Inductor Progress Bars (#88384)"
This reverts commit db0ce4acf3.

Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board
2022-12-09 16:32:25 +00:00
Fabio Rocha
c6c2de586d [inductor] New approach for computing triton load/store masks (#89566)
This PR changes the way masks for loads/stores are computed in triton backend of inductor.

New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and  `r3` it adds `xmask`, `ymask` and `rmask` respectively.
For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created.

I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89566
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-12-09 12:43:19 +00:00
Mark Saroufim
db0ce4acf3 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-12-09 04:32:31 +00:00
William Wen
d224ac7f77 Remove logging.CODE (#90234)
Fixes https://github.com/pytorch/torchdynamo/issues/1932

Discussed with @mlazos: if we still want to separate streams for code logging and the rest of info, we can use a separate logger object with a unique name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90234
Approved by: https://github.com/ezyang
2022-12-06 22:24:43 +00:00
Nikita Karetnikov
226e803ecb [Inductor] handle non-positive exponents in Pow (#90146)
Fixes #90125.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90146
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-12-05 09:16:35 +00:00
Elias Ellison
acd68f9097 [Reland] dont clone args (#89766)
Reland of https://github.com/pytorch/pytorch/pull/89519.

Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning because of the 250mb cache clearing in triton benchmarking.

Reland bc previously we weren't accounting for inplace buffer reuse correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89766
Approved by: https://github.com/jansel
2022-12-02 17:20:40 +00:00
Jean Schmidt
f62e54df8f Reland "Dynamo, FX, Inductor Progress Bars (#88384)" … (#90055)
This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly.

Original commit: #88384 (011452a2a1)
Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3)
Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f2280
Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
2022-12-02 13:28:00 +00:00
PyTorch MergeBot
cf3c3f2280 Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)"
This reverts commit bcf4292f04.

Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits
2022-12-02 09:57:31 +00:00
Wang, Eikan
0bde810572 Add more debug information for Inductor (#90008)
- Add graph index to the profile information of the Inductor kernel for better debugability.

  The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation.

- Label arbitrary code ranges for `eager` and `opt` modes for better debugability

  The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 09:34:48 +00:00
Elias Ellison
6addc8d923 [Inductor] add expm1 lowering (#89961)
Improves perf of inductor no-cudagraphs on nvidia-deeprecommender from 0.88 -> .96. I am looking into disabling implicit fallbacks for benchmark models in another pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89961
Approved by: https://github.com/ngimel
2022-12-02 04:29:54 +00:00
Animesh Jain
d09c52e4fd [inductor] Deterministic kernel names (#89713)
`node.origins` is a set and does not have an order. Therefore, inductor w and w/o cudagraphs experiments generate different kernel names, making it hard to debug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89713
Approved by: https://github.com/soumith, https://github.com/mlazos, https://github.com/ngimel
2022-12-02 02:37:36 +00:00
Eli Uriegas
bcf4292f04 Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)
This breaks in environments that use the fake tqdm 015b05af18/torch/hub.py (L26) which doesn't support the 'desc' kwarg and is not iterable

Original try using pytorchbot did not go through because of a merge
conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489

This reverts commit 011452a2a1.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018
Approved by: https://github.com/drisspg, https://github.com/dbort
2022-12-01 20:17:07 +00:00
Wu, Chunyuan
a6caa9c54b Add a cpp wrapper for Inductor (#88167)
## Description
Implements https://github.com/pytorch/torchdynamo/issues/1556.
This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting:
```python
from torch._inductor import config
config.cpp_wrapper = True
```

### Example
The main part of the generated code:
```python
from torch.utils.cpp_extension import load_inline
wrapper = (
'''
#include <dlfcn.h>
#include <assert.h>
    std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float);
    auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float);
    auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW);
    assert(kernel0_lib != nullptr);
    void (*kernel0)(const float*,const float*,float*,float*);
    *(void **) (&kernel0) = dlsym(kernel0_lib, "kernel");
    kernel0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    return std::make_tuple(buf0, buf1); }''' )

module = load_inline(
    name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu',
    cpp_sources=[wrapper],
    functions=['call_0'],
    extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'],
    extra_ldflags=['-shared  -lgomp'],
    extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m'])

def _wrap_func(f):
    def g(args):
        return f(args)
    return g
call = _wrap_func(module.call_0)
```

### Next steps
The below items will be addressed in upcoming PRs.
- [x] Support Reduction: #88561
- [x] Support None: #88560
- [ ] Support ExternKernel
   - [x] ATen GEMM-related OPs: #88667
   - [ ] ATen Conv
   - [ ] Conv/GEMM fusion OPs
- [x] Cache the kernel loading part: #89742
- [ ] De-allocate input buffers when possible by leveraging CPython APIs
- [ ] Support Constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-11-30 13:40:47 +00:00
Wang, Eikan
92f08f09d8 Vectorize erf (#89837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89837
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2022-11-30 06:42:36 +00:00
Mark Saroufim
011452a2a1 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-11-30 06:07:14 +00:00
Jiong Gong
c75434ed4f [Inductor] Add an option to mark wrapper call in PyTorch profiler (#89674)
This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674
Approved by: https://github.com/jansel
2022-11-29 00:58:46 +00:00
Jiong Gong
bb77accb4c [Inductor] Record cpp kernel in PyTorch Profiler (#89367)
Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367
Approved by: https://github.com/jansel
2022-11-26 14:06:44 +00:00
Natalia Gimelshein
3e20d023b1 put descriptive kernel names behind config (#89697)
Per title, generated kernel names are often long and confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89697
Approved by: https://github.com/Chillee
2022-11-26 03:08:23 +00:00
Natalia Gimelshein
61a3fe4b64 make inductor correctly propagate nans for maximum and minimum (#89612)
Partially fixes https://github.com/pytorch/torchdynamo/issues/594
Also, small cleanup for `where` codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-25 19:42:38 +00:00
Edward Z. Yang
0884fdaba0 Revert "Dont clone unmutated args in triton autotuning (#89519)" (#89652)
This reverts commit f18f0c70ab.

Testing to see if this fixes gmixer_24_224 mixer_b16_224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89652
Approved by: https://github.com/eellison
2022-11-24 22:49:09 +00:00
Elias Ellison
f18f0c70ab Dont clone unmutated args in triton autotuning (#89519)
Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning. Any other pointers on where the overhead is coming from in autotuning would be great.

Edit: i think it's just the triton cache clearing 44f577984d/python/triton/testing.py (L159)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89519
Approved by: https://github.com/ngimel, https://github.com/jansel
2022-11-23 22:00:03 +00:00
Animesh Jain
1cfd3858ac [inductor] Use dense masks for indirect indexing (#89524)
Fixes https://github.com/pytorch/torchdynamo/issues/1654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89524
Approved by: https://github.com/jansel
2022-11-23 00:48:00 +00:00
Bin Bao
2823fc5e4c [inductor] generate nan in the cpp backend (#89289)
Summary: Fixes https://github.com/pytorch/torchdynamo/issues/1797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89289
Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5
2022-11-22 15:54:04 +00:00
Wang, Eikan
40cf214f2d Support masked_fill to address the GPT2 performance issue (#89274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89274
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-22 04:12:43 +00:00
Peter Bell
1267dcf297 [inductor] Fix nan handling for aten.sign (#88937)
ATen gives `sign(nan) == 0` but inductor's cuda codegen would give
`sign(nan) == 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88937
Approved by: https://github.com/ngimel
2022-11-21 20:56:40 +00:00
Wang, Eikan
bc716383a6 Redefine the simdlen semantic (#89263)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-21 09:08:16 +00:00
Natalia Gimelshein
51e961dd7b use std/libdevice erf in inductor (#89388)
By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel).
Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388
Approved by: https://github.com/jansel
2022-11-21 00:58:03 +00:00
PyTorch MergeBot
706f791a19 Revert "Support masked_fill (#88736)"
This reverts commit 2b131b1d43.

Reverted https://github.com/pytorch/pytorch/pull/88736 on behalf of https://github.com/kit1980 due to Inductor tests are failing with AttributeError: module 'torch._inductor.codecache' has no attribute 'valid_vec_isa_list'
2022-11-17 18:27:08 +00:00
Wang, Eikan
2b131b1d43 Support masked_fill (#88736)
Support `masked_fill` to address the GPT2 performance issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88736
Approved by: https://github.com/jansel, https://github.com/jgong5
2022-11-17 15:18:29 +00:00
PyTorch MergeBot
4e1d19c5a5 Revert "Redefine the simdlen semantic: (#88482)"
This reverts commit fce6d6b3dc.

Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554
2022-11-17 04:58:53 +00:00
Wang, Eikan
fce6d6b3dc Redefine the simdlen semantic: (#88482)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-17 03:27:54 +00:00
Fabio Rocha
9262d18e1b [inductor] Introduce CSEVariable type and use it to track if Triton variables are scalar (#88347)
This fixes https://github.com/pytorch/torchdynamo/issues/1515

To fix it, we need to keep track of whether a Triton variable is a scalar (so we can not use a mask when doing indirect loads through them). This requires a way of annotating variable names generated by CSE with properties.

So now CSE will use CSEVariable class to keep track of variables and let backends subclass it so they can annotate them with whatever information they want. TritonCSEVariable is such a subclass that track the `is_scalar` property.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88347
Approved by: https://github.com/jgong5, https://github.com/ngimel
2022-11-15 20:52:37 +00:00
Jongsoo Park
0544a32ba3 [inductor] fix could not find as_strided with config.triton.mm=triton (#88946)
Summary: ReinterpretView doesn't seem to be handled properly with matrix multiply Triton kernels

Reviewed By: bertmaher

Differential Revision: D40836677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88946
Approved by: https://github.com/jansel
2022-11-15 00:48:49 +00:00
Michael Lazos
c1553880de Have kernel names include fused ops (#88624)
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long

Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-11-10 21:38:06 +00:00
blzheng
fca6ed02b9 [Inductor] fix c++ compile error with masked float value init (#88298)
Fixes #88201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88298
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-09 10:40:25 +00:00
Peter Bell
8e2627d42f [inductor] Fix aten.fmod lowering (#88602)
Currently the lowering for aten.fmod promotes integral types to float and calls
`tl.libdevice.fmod` whereas the ATen behavior is to use the modulo operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88602
Approved by: https://github.com/jansel
2022-11-08 20:27:36 +00:00
Wang, Eikan
ad27d762a7 Support sign for HF models like ElectraForQuestionAnswering (#88160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88160
Approved by: https://github.com/jansel
2022-11-07 09:10:37 +00:00
Wang, Eikan
a9d37ce8f5 Support reduction vectorization (#87356)
This PR is to optimize reduction implementation by `at::vec`. The main idea is as same as the aten implementation.
- Step1: Parallelize and vectorize the reduction implementation
- Step2: Invoke `at::vec::vec_reduce_all` to reduce the vector generated at step 1 to a single scalar
- Step3: Handle the tail elements

For the implementation, we create two kernels - `CppVecKernel` and `CppKernel`. The code block generation is as follows step by step.

- Gen the non-reduction loop - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1008-L1010)
- Gen the reduction initialization both for vectorization and non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1015)
- Gen the reduction loop for the vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1021-L1023)
- Gen the code to reduce the vector to scalar - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1033)
- Gen the reduction loop for the non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1042)
- Do some post-reduction things like store reduction value - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1049)

```python
# Gen the non-reduction loop
for loop in CppVecKernel.NoneReductionLoop:
    # Gen the reduction initialization both for vectorization and non-vectorization kernel
    CppVecKernel.ReductionPrefix
    # Gen the reduction loop for the vectorization kernel
    for loop in CppVecKernel.ReductionLoop
        CppVecKernel.Loads
        CppVecKernel.Compute
        CppVecKernel.Stores
    # Gen the code to reduce the vector to scalar
    CppVecKernel.ReductionSuffix
    # Gen the reduction loop for the non-vectorization kernel
    for loop in CppKernel.ReductionLoop
        CppKernel.Loads
        CppKernel.Compute
        CppKernel.Stores
    # The reduction is almost finished. To do some post-reduction things like store reduction value.
    CppKernel.ReductionSuffix
```
The code snippet for maximum reduction exemplifies the idea. More detailed comments are inlined.

```C++
    {
        // Declare reduction for at::vec::Vectorized since it is not built-in data type.
        #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})

        float tmp4 = 0;
        // tmp4_vec is used to vectorize the sum reduction for tmp4
        auto tmp4_vec = at::vec::Vectorized<float>(tmp4);
        float tmp6 = 0;
        // tmp6_vec is used to vectorize the sum reduction for tmp6
        auto tmp6_vec = at::vec::Vectorized<float>(tmp6);
        #pragma omp parallel num_threads(48)
        {
            // Parallelize the vectorized reduction
            #pragma omp for reduction(+:tmp4_vec) reduction(+:tmp6_vec)
            for(long i0=0; i0<192; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
                auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
                auto tmp2 = tmp0 - tmp1;
                auto tmp3 = tmp2.abs();
                auto tmp5 = tmp2 * tmp2;
                tmp4_vec += tmp3;
                tmp6_vec += tmp5;
            }
            // Reduce the tmp4_vec as a scalar and store at tmp4
            tmp4 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp4_vec);
            // Reduce the tmp6_vec as a scalar and store at tmp6
            tmp6 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp6_vec);
            // Handle the tail elements that could not be vectorized by aten.
            #pragma omp for simd simdlen(4) reduction(+:tmp4) reduction(+:tmp6)
            for(long i0=1536; i0<1536; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = in_ptr1[i0];
                auto tmp2 = tmp0 - tmp1;
                auto tmp3 = std::abs(tmp2);
                auto tmp5 = tmp2 * tmp2;
                tmp4 += tmp3;
                tmp6 += tmp5;
            }
        }
        out_ptr0[0] = tmp4;
        out_ptr1[0] = tmp6;
    }
```

Performance(Measured by operatorbench and the base line of speedup ratio is aten operator performance):
Softmax (1,16,384,384,dim=3) | Speedup ratio (simdlen=None) |  Speedup ratio (simdlen=8) + this PR
-- | -- | --
24c | 0.37410838067524177 | 0.9036240100351164
4c | 0.24655829520907663 | 1.0255329993674518
1c | 0.21595768114988007 | 1.000587368005134

HW Configuration:
SKU: SKX Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
MemTotal:       196708148 kB
MemFree:        89318532 kB
MemBandwidth:  112195.1MB/S

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87356
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:40:34 +00:00
Wang, Eikan
6541e51ffd Explicit vectorization support for TorchInductor (#87068)
In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8.

```C++
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       const long ks0,
                       const long ks1)
{
    #pragma omp parallel num_threads(48)
    {
        #pragma omp for
        for(long i0=0; i0<((ks0*ks1) / 8); ++i0)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = tmp2.exp();
            tmp3.store(out_ptr0 + 8*i0);
        }
        #pragma omp for simd simdlen(4)
        for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0)
        {
            auto tmp0 = in_ptr0[i0];
            auto tmp1 = in_ptr1[i0];
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = std::exp(tmp2);
            out_ptr0[i0] = tmp3;
        }
    }
}

```

The major pipeline is as follows.
- Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](bf66991fc4/torch/_inductor/codegen/cpp.py (L702))is to check whether all the `ops` have been supported. The [other one](355326faa3/torch/_inductor/codegen/cpp.py (L672)) is to check whether the data access could be vectorized.
  - [`CppSimdVecKernelChecker`](355326faa3/torch/_inductor/codegen/cpp.py (L655))
- Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized.
  - [`CppSimdVecKernel`](355326faa3/torch/_inductor/codegen/cpp.py (L601))
  - [`CppSimdVecOverrides`](355326faa3/torch/_inductor/codegen/cpp.py (L159)): The ops that we have supported on the top of `aten::vec`
  - Create kernel
    - [`aten::vec` kernel](355326faa3/torch/_inductor/codegen/cpp.py (L924))
    - [`Original CPP kernel - OMP SIMD`](355326faa3/torch/_inductor/codegen/cpp.py (L929))
- Generate code
  - [`CppKernelProxy`](355326faa3/torch/_inductor/codegen/cpp.py (L753)) is used to combine the `aten::vec` kernel and original cpp kernel
    - [Vectorize the most inner loop](355326faa3/torch/_inductor/codegen/cpp.py (L753))
    - [Generate code](355326faa3/torch/_inductor/codegen/cpp.py (L821))

Next steps:
- [x] Support reduction
- [x] Vectorize the tail loop with `aten::vec`
- [ ] Support BF16
- [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:24:14 +00:00
Natalia Gimelshein
b4fcfe77b2 reduce the number of autotuning iterations, don't autotune simple til… (#88386)
…ed copies

Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s.

Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386
Approved by: https://github.com/jansel
2022-11-03 15:58:18 +00:00
Fabio Rocha
4ab5d79b28 [inductor] Updated some triton.libdevice calls (#88242)
triton master now does not require `d` or `f` suffix
to some libdevice function calls - it dispatches to right
library call based on argument type.

triton pin updated to
f16138d447

Also removed some xfails for some unrelated tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88242
Approved by: https://github.com/ngimel
2022-11-02 04:58:43 +00:00
Bin Bao
4e3a0ff92e Update how inductor cpu tests are skipped on fbcode (#87867)
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87867
Approved by: https://github.com/anijain2305
2022-10-28 00:33:54 +00:00
PyTorch MergeBot
6cc4ae3d2d Revert "[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)"
This reverts commit 369755f8ce.

Reverted https://github.com/pytorch/pytorch/pull/87809 on behalf of https://github.com/kit1980 due to Broke trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 4, 4, linux.g5.4xlarge.nvidia.gpu), same error on pull.
2022-10-27 23:55:59 +00:00
Yanbo Liang
369755f8ce [Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)
Fixes #ISSUE_NUMBER

cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809
Approved by: https://github.com/ngimel
2022-10-27 20:58:48 +00:00
William Wen
a605a30732 Fix CODE level usage in dynamo config.py (#87522)
Fixes https://github.com/pytorch/torchdynamo/issues/1718.

Tested by changing `log_level = logging.WARNING` in config.py to `log_level = logging.CODE` and running a test script that doesn't touch `log_level`.

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87522
Approved by: https://github.com/mlazos
2022-10-25 22:47:54 +00:00
stumpOS
8a2a4ed488 consider numel args when identifying aligned args (#87394)
Fixes #ISSUE_NUMBER
https://github.com/pytorch/torchdynamo/issues/1527

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87394
Approved by: https://github.com/jansel
2022-10-25 17:00:27 +00:00
Yanbo Liang
9ba632253a [Inductor] Convert 0d CPU tensor to scalar during triton codegen (#87329)
This is a follow up to address [this](https://github.com/pytorch/torchdynamo/pull/1284#pullrequestreview-1130319129). We revised to use the codegen approach to handle 0d CPU tensor, which will not support cudagraph any more.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87329
Approved by: https://github.com/ngimel
2022-10-21 01:24:00 +00:00