Commit Graph

43 Commits

Author SHA1 Message Date
Horace He
2a08a62777 Add extra metadata (as comments) to Inductor generated code (#96581)
New output
<img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581
Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym
2023-03-14 03:59:59 +00:00
Shunting Zhang
cc699c56dc reland #96248 [inductor] show performance for each autotune config for a kernel (#96458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458
Approved by: https://github.com/ngimel
2023-03-10 01:40:04 +00:00
PyTorch MergeBot
fe05266fda Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)"
This reverts commit deaf9e5e65.

Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378
2023-03-09 01:45:24 +00:00
Bin Bao
deaf9e5e65 [reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985
Approved by: https://github.com/jansel
2023-03-08 20:02:32 +00:00
Shunting Zhang
962b3f78bd [inductor] run all kernel benchmarks individually in a compiled module (#95845)
This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace.

Here are the steps:
1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g.
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training
```
2. From the output we will see 3 lines like
```
Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py
```
That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module.

3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file.

```
python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k
```

Example output:
<img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png">

Note: I use the first 10 characters of the hash to identify each kernel since
1. hash is easier to get in the code :)
2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash)

If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845
Approved by: https://github.com/Chillee
2023-03-06 21:30:33 +00:00
PyTorch MergeBot
879400e4e8 Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)"
This reverts commit 73b66098b2.

Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, 73b66098b2 (11745396725)
2023-03-03 17:33:27 +00:00
Bin Bao
73b66098b2 [inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)
Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822
Approved by: https://github.com/jansel
2023-03-03 14:18:09 +00:00
Will Constable
92a2107375 Support Inductor collectives with wait or collective outside graph (#95893)
Inductor implementations of collectives/wait must match
eager impls in _functional_collectives in terms of interacting
with _register_tensor_work API.  If they do, then splitting
a collective-wait pair so one half is in a compiled graph should
work fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893
Approved by: https://github.com/kumpera
2023-03-03 09:00:48 +00:00
Jason Ansel
00ebbba623 Remove torch._inductor.config.triton.convolution (#95842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95842
Approved by: https://github.com/ngimel
2023-03-02 17:44:41 +00:00
Shunting Zhang
5d29b68bbc [inductor] generate triton kernel benchmark (#95506)
A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module.

Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong.  In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b .  A compiled module contains the following part:
- various triton kernels
- a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor
- some utility code that generate random inputs and run the wrapper

The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning.

This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well.

The path to each compiled triton kernel is added as comment in the compiled module. E.g.
```
# kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py
triton__0 = async_compile.triton('''
import triton
import triton.language as tl
...
""")
````

Example command:
```
TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506
Approved by: https://github.com/Chillee
2023-03-01 18:29:07 +00:00
Edward Z. Yang
58648822b6 Handle int/float arguments for cpp codegen in inductor (#95533)
This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-28 03:57:35 +00:00
Horace He
01c861af14 Added utilities to instrument kernel bandwidth numbers (#95355)
Looks like

![image](https://user-images.githubusercontent.com/6355099/221048077-33aeff50-0951-42c9-89e9-22049db4f94d.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95355
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-24 17:51:11 +00:00
Nicolas Macchioni
dd7e2b7c0e [pt2][inductor] update choice caller hashes (#94853)
Summary:
update the hashing method for `ChoiceCaller` class.

`TritonTemplateCaller` objects will now be hashed to:
`{name}-({BLOCK_M}, {BLOCK_N}, {BLOCK_K})-{num_stages}-{num_warps}-{code_hash}`

for example:
`triton_mm-(64, 32, 32)-4-8-cptlntwzcl2gaaofd2oabdwhaqv4ox3lluvbuxitjfhhpz6cyl4o`

`ExternKernelCaller` objects will now be hashed to:
`{name}-{kwargs.keys()[0]}={kwargs.vals()[0]}-...-{code_hash}`

for example:
`addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf`

Test Plan: sandcastle

Differential Revision: D43285470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94853
Approved by: https://github.com/jansel, https://github.com/bertmaher
2023-02-16 00:11:26 +00:00
Natalia Gimelshein
a5daea69fb teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-10 11:21:57 +00:00
PyTorch MergeBot
6007874bbb Revert "teach inductor to handle floor (#94341)"
This reverts commit e7df9aaec8.

Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related.  It fails on both PR and trunk e7df9aaec8
2023-02-09 19:31:08 +00:00
Natalia Gimelshein
e7df9aaec8 teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-09 17:09:35 +00:00
Eli Uriegas
567e6152da Revert "[inductor] fix crash issue when input is a view tensor (#90150)" (#94329)
Had to provide a merge conflict resolution due to conflicts with https://github.com/pytorch/pytorch/pull/94118

This was causing issues with internal tests that look similar to:
```
in clone_preserve_strides
    x.size(), x.stride(), x.storage_offset()
AttributeError: 'KeyedJaggedTensor' object has no attribute 'size'
```

See https://fburl.com/testinfra/nc0du2sp for more information

This reverts commit #90150

@jansel can you help @blzheng with re-landing this as a co-development diff?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94329
Approved by: https://github.com/jansel
2023-02-07 20:45:58 +00:00
Natalia Gimelshein
ca8450849b compute dynamic tensor shapes for indexing on the host (#93872)
Hoists computation of some shapes used in triton kernel indexing to the host, so resulting triton code is
```
x1 = (xindex // pks0) % 64
```
instead of
```
x1 = (xindex // (1 + (((((-1) + ks0) // 4))*((((-1) + ks0) // 4))) + (2*((((-1) + ks0) // 4))))) % 64
```
with `pks0` arg computed on the host
```
ps0 = (1 + ((((-1) + s2) // 4)))*(1 + ((((-1) + s2) // 4)))
```
It doesn't work yet for indexing expressions that are directly in the `load` statement, e.g.
```
tmp0 = tl.load(in_ptr0 + (r1 + x0 + (x0*(((((-1) + ks0) // 32))*((((-1) + ks0) // 32)))) + (2*x0*((((-1) + ks0) // 32)))), rmask & xmask, eviction_policy='evict_last').to(tl.float32)
```
Unfortunately, `unet` which is one of the examples failing with floor does the latter:
```
tmp1 = ((-1)*(1/(((-1) + (floor(2.0*(ks0//16))))))) + ((1/(((-1) + (floor(2.0*(ks0//16))))))*(ks0 // 16))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93872
Approved by: https://github.com/jansel
2023-02-03 09:58:39 +00:00
blzheng
a71395dd88 [inductor] fix crash issue when input is a view tensor (#90150)
Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-03 04:54:14 +00:00
PyTorch MergeBot
5d259425fc Revert "[inductor] fix crash issue when input is a view tensor (#90150)"
This reverts commit b11ec270ba.

Reverted https://github.com/pytorch/pytorch/pull/90150 on behalf of https://github.com/clee2000 due to failing test_inplace_unsqueeze3 (__main__.CPUReproTests) https://github.com/pytorch/pytorch/actions/runs/4074618739/jobs/7020199369 b11ec270ba, marking as landrace cuz all jobs are green on pr
2023-02-02 17:06:34 +00:00
Will Constable
a14e3190e3 Mark buffers that reuse other buffers (#93329)
Provides a way at codegen time to emit code conditioned on
having a fresh allocation vs reusing an input.

- For collective ops, if reusing an input, a copy can be skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93329
Approved by: https://github.com/jansel
2023-02-02 14:22:26 +00:00
blzheng
b11ec270ba [inductor] fix crash issue when input is a view tensor (#90150)
Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-02 12:49:26 +00:00
Edward Z. Yang
ca9ebf9e2b Delete dynamo_import and inductor_import (#93851)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93851
Approved by: https://github.com/albanD, https://github.com/jansel
2023-02-02 01:51:29 +00:00
Wu, Chunyuan
42633cf5f9 Inductor cpp wrapper: cache the loading of the kernel (#89742)
### Pitch
Cache the loaded kernel to reduce the overhead.

#### Code before:
```cpp
std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    ...
    auto kernel_cpp_0_lib = dlopen("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so", RTLD_NOW);
    assert(kernel_cpp_0_lib != nullptr);
    void (*kernel_cpp_0)(const float*,const float*,float*,float*);
    *(void **) (&kernel_cpp_0) = dlsym(kernel_cpp_0_lib, "kernel");
    kernel_cpp_0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    ...
}
```

#### Code after:
```cpp
template <typename KernelFunc>
KernelFunc load_cpp_kernel(const char* so_filename) {
    KernelFunc kernel_cpp;
    auto kernel_cpp_lib = dlopen(so_filename, RTLD_NOW);
    assert(kernel_cpp_lib != nullptr);
    *(void **) (&kernel_cpp) = dlsym(kernel_cpp_lib, "kernel");
    return kernel_cpp;
}

std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    ...
    static auto kernel_cpp_0 = load_cpp_kernel<void (*)(const float*,const float*,float*,float*)>("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so");
    kernel_cpp_0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89742
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-01 05:05:50 +00:00
Jason Ansel
9b173b87b2 Refactor away leftover import indirection (#92188)
This indirect ways of importing are a leftover from when we wanted to support both `import torchdynamo` and `import torch._dynamo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92188
Approved by: https://github.com/desertfire
2023-01-18 04:53:05 +00:00
Natalia Gimelshein
5625f521a4 generate set_device call to ensure context existence (#92055)
Hopefully Fixes https://github.com/pytorch/torchdynamo/issues/2026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92055
Approved by: https://github.com/wconstab
2023-01-12 17:23:49 +00:00
Peter Bell
eece6da162 [inductor] Reduce device context manager overhead (#91045)
This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:

```python
def set_device():
    with torch.cuda.device(0):
        pass

%timeit set_device()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-01-12 16:51:59 +00:00
Jason Ansel
7c1c239db1 [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575)
This reverts commit 94262efc7d to reland #91105 / #90738.

Fixes https://github.com/pytorch/torchdynamo/issues/2015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575
Approved by: https://github.com/ngimel
2023-01-11 00:08:03 +00:00
PyTorch MergeBot
94262efc7d Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)"
This reverts commit d6dd2e97da.

Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds
2022-12-21 00:02:38 +00:00
Jason Ansel
d6dd2e97da [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)
https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land.  This is an exact copy that PR without using ghstack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105
Approved by: https://github.com/ngimel
2022-12-20 02:38:23 +00:00
Natalia Gimelshein
a10b3ce876 generate device context managers in inductor code (#90934)
Fixes https://github.com/pytorch/torchdynamo/issues/1717, https://github.com/pytorch/torchdynamo/issues/1990

<s>TODO: add test with multiple devices, figure out extra context initialization</s>

Problems:
<s>It still initializes context on 0-th device that it shouldn't, I'll take a look where that happens and fix before landing</s>
It adds a python device context manages, that is absurdly slow and takes ~2.5 us (should be nanoseconds). That's not a problem for real models, because it'll be called just once, but it is a bit of an inconvenience for microbenchmarking, we should make that context manager more performant (won't fix in this PR)
It still can have bugs for graphs that run on multiple devices and can have buffers incorrectly shared between multiple device by memory reuse, if that happens that'll need to be solved separately.

Generated code:
```
def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    with torch.cuda.device(1):
        buf0 = empty_strided((4, ), (1, ), device='cuda', dtype=torch.float32)
        stream1 = get_cuda_stream(1)
        triton_fused_div_0.run(arg0_1, arg1_1, buf0, 4, grid=grid(4), stream=stream1)
        del arg0_1
        del arg1_1
        return (buf0, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90934
Approved by: https://github.com/wconstab
2022-12-16 18:03:39 +00:00
chunyuan
2ba5c1d7c4 Inductor cpp wrapper: change inputs args from tuple to vector (#90754)
## Pitch
Change input args type from `std::tuple` to `std::vector` to reduce the compilation time.

## Description
`std::tie()` takes quite a long time during the compilation when the input args number grows.

For example, for a graph from the `PegasusForConditionalGeneration` model with 318 input args, the compilation of `std::tie` for the args is about 10s. By changing to std::vector, the compilation time of arg assignment is reduced to less than 1s.

### Code before:
```cpp
at::Tensor call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    ...
    return buf0;
}
```

### Code after:
```cpp
at::Tensor call_0(std::vector<at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    arg0_1 = args[0];
    arg1_1 = args[1];
    ...
    return buf0;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90754
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-15 05:07:16 +00:00
chunyuan
fde5646f3d Inductor cpp wrapper: support bmm, mm, addmm extern call (#88667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88667
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-14 18:19:27 +00:00
chunyuan
d35aa2f65a Inductor cpp wrapper: support Reduction (#88561)
For reductions, the code string in the codegen stage and the execution stage are different due to `\`.

- The code string gotten from `code.getvalue()` (`code` is an `IndentedBuffer`) in codegen stage:
  ```
  #pragma omp declare reduction(argmax : struct IndexValue_1 :\
                  omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
                  omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
                  initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
  ```

- The code string loaded during the execution (`\` will be escaped):
  ```
  #pragma omp declare reduction(argmax : struct IndexValue_1 :                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)                  initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
  ```

Thus we can't get the same hash value for these two pieces of code.
This PR adds a function to make the transformation escape the backslash in the codegen stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88561
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-12-14 12:29:50 +00:00
chunyuan
e2e4a80cdb Inductor cpp wrapper: support None as output (#88560)
Map `None` to `at::Tensor()` in the cpp wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88560
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-12-14 02:28:22 +00:00
Bert Maher
d3d85e1c3b Emit torch.cuda.synchronize() after every kernel call in inductor (#90472)
Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1
and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace.
doesn't necessarily guarantee that you'll get a stack trace pointing to the
right kernel.  This diff adds a config option to force a CUDA synchronize after
every kernel call in inductor, for debugging those tricky cases.

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/)

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472
Approved by: https://github.com/jansel
2022-12-12 04:35:10 +00:00
Wu, Chunyuan
a6caa9c54b Add a cpp wrapper for Inductor (#88167)
## Description
Implements https://github.com/pytorch/torchdynamo/issues/1556.
This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting:
```python
from torch._inductor import config
config.cpp_wrapper = True
```

### Example
The main part of the generated code:
```python
from torch.utils.cpp_extension import load_inline
wrapper = (
'''
#include <dlfcn.h>
#include <assert.h>
    std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float);
    auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float);
    auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW);
    assert(kernel0_lib != nullptr);
    void (*kernel0)(const float*,const float*,float*,float*);
    *(void **) (&kernel0) = dlsym(kernel0_lib, "kernel");
    kernel0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    return std::make_tuple(buf0, buf1); }''' )

module = load_inline(
    name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu',
    cpp_sources=[wrapper],
    functions=['call_0'],
    extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'],
    extra_ldflags=['-shared  -lgomp'],
    extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m'])

def _wrap_func(f):
    def g(args):
        return f(args)
    return g
call = _wrap_func(module.call_0)
```

### Next steps
The below items will be addressed in upcoming PRs.
- [x] Support Reduction: #88561
- [x] Support None: #88560
- [ ] Support ExternKernel
   - [x] ATen GEMM-related OPs: #88667
   - [ ] ATen Conv
   - [ ] Conv/GEMM fusion OPs
- [x] Cache the kernel loading part: #89742
- [ ] De-allocate input buffers when possible by leveraging CPython APIs
- [ ] Support Constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-11-30 13:40:47 +00:00
Jiong Gong
c75434ed4f [Inductor] Add an option to mark wrapper call in PyTorch profiler (#89674)
This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674
Approved by: https://github.com/jansel
2022-11-29 00:58:46 +00:00
Michael Lazos
c1553880de Have kernel names include fused ops (#88624)
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long

Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-11-10 21:38:06 +00:00
Jason Ansel
30f6f6903c [inductor] Move size asserts to C++, fix bug (#87028)
Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression).

This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension.

This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028
Approved by: https://github.com/anijain2305
2022-10-16 20:17:22 +00:00
Jason Ansel
054a2fd6c2 Sync changes from pytorch/torchdynamo (#87013)
This updates to:
6380959be2

Generated with:
https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013
Approved by: https://github.com/voznesenskym
2022-10-15 21:00:57 +00:00
Jason Ansel
8f71e8de7e Sync changes from pytorch/torchdynamo, enable tests (#86950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86950
Approved by: https://github.com/Chillee
2022-10-14 23:08:58 +00:00
Jason Ansel
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00