Commit Graph

133 Commits

Author SHA1 Message Date
Sherlock Huang
b9dfdc091b [AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279) (#108350)
Summary:

This is a prototype for running extern fallback kernels with a host side proxy executor.

Sample of generated cpp wrapper call:
```
        at::Tensor buf0;  // output buffer
        void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
        int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
        proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```

- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.

- These are all the input argument types I have support so far.
       union Argument {
         # Bool value does not matter
         1: bool asNone;
         2: TensorArgument asTensor;
         3: list<TensorArgument> asTensors;
         5: i64 asInt;
         7: list<i64> asInts;
         8: double asFloat;
         9: list<double> asFloats;
         10: string asString;
         10.5: list<string> asStrings;
         11: SymIntArgument asSymInt;
         12: list<SymIntArgument> asSymInts;
         13: ScalarType asScalarType;
         14: MemoryFormat asMemoryFormat;
         15: Layout asLayout;
         16: Device asDevice;
         17: bool asBool;
         18: list<bool> asBools;
       }

- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC  implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.

For fixing T162112344

Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main

test:
 buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops

backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main

buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'

Reviewed By: suo

Differential Revision: D48747417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb
2023-09-02 17:14:10 +00:00
Bin Bao
06d74e6b24 Revert "[AOTInductor] Include constants in AOTInductor .so file. (#10… (#108349)
This reverts commit c3239442a3 due to internal test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108349
Approved by: https://github.com/aakhundov, https://github.com/zhxchen17
2023-08-31 16:26:02 +00:00
Shunting Zhang
7cb4bf675b [inductor] no-side-effect codegen (#107617)
Inductor kernel codegen previously have the following side effect:
- in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers
- during codegen, we do memory allocation/free.

These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617
Approved by: https://github.com/jansel
2023-08-31 00:25:17 +00:00
Jason Ansel
2c87ef3dbf [inductor] Fix inputs with existing offsets (#108168)
This cherrypicks the reinterpret_tensor change from #102625 in order to fix a subtle correctness bug when the graph inputs already have a storage_offset set.

The view change also fixes some issues with quantized models in torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108168
Approved by: https://github.com/desertfire
2023-08-29 23:47:03 +00:00
Mu-Chu Lee
c3239442a3 [AOTInductor] Include constants in AOTInductor .so file. (#107718)
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.

Test Plan:
test/inductor/test_aot_inductor.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
2023-08-29 22:37:30 +00:00
Yang Chen
2179ebde1f [inductor] correctly handle resize for AOTInductor wrapper calls (#107848)
When generating a wrapper call, we may have implicit resize applied to
the kernel's output. For example, for addmm(3d_tensor, 2d_tensor),
its output buffer is resized to a 2d tensor. This triggers a warning from
Aten's resize_output op:

    "UserWarning: An output with one or more elements was resized since it had...
    This behavior is deprecated, and in a future PyTorch release outputs will
    not be resized unless they have zero elements..."

More importantly, the output shape is not the same as we would expect, i.e.
2d tensor v.s. 3d tensor.

This PR fixed the issue by injecting resize_(0) before calling the relevant
kernel and resize_(expected_shape) after the kernel call.

We also fixed a minor typo in the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107848
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-08-27 09:56:16 +00:00
Wei Wei
497571df58 [aot_inductor] fix hardcoded output dtype (#107825)
Summary: as titled

Reviewed By: chenyang78

Differential Revision: D47779519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107825
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-08-24 22:16:13 +00:00
Adnan Akhundov
1491bae277 [reland][inductor] Adjust dynamic SMEM limit when above default in AOT (#107814)
Summary:

This relands #107601, which was reverted due to the new test failing in the internal CI. Here we skip the new test (as well as the existing tests in `test_aot_inductor.py`, as those are also failing in the internal CI).

Test Plan:

```
$ python test/inductor/test_aot_inductor.py
...
----------------------------------------------------------------------
Ran 5 tests in 87.309s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48623171](https://our.internmc.facebook.com/intern/diff/D48623171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107814
Approved by: https://github.com/eellison
2023-08-24 07:59:51 +00:00
PyTorch MergeBot
42897e8127 Revert "[inductor] Adjust dynamic SMEM limit when above default in AOT (#107601)"
This reverts commit 3920ce2f6e.

Reverted https://github.com/pytorch/pytorch/pull/107601 on behalf of https://github.com/ZainRizvi due to Sorry, but the test added in this PR breaks when run internally. See D48549503 for more details ([comment](https://github.com/pytorch/pytorch/pull/107601#issuecomment-1689049609))
2023-08-22 23:26:50 +00:00
Adnan Akhundov
3920ce2f6e [inductor] Adjust dynamic SMEM limit when above default in AOT (#107601)
Summary:

When AOT Inductor runs a Triton matmul kernel (generated from the Triton mm template) on large inputs of particular shape, the `RuntimeError: CUDA driver error: 1` may happen. E.g., when `x @ y` is compiled with AOT Inductor and run on the input shapes `[10285, 96]` and `[96, 1]`. Digging deeper into the generated AOT Inductor wrapper code, we see this line:

```
launchKernel(triton_unk_fused_mm_0, 81, 1, 1, 4, 55296, kernel_args_var_0, stream);
```

`55296` is the required amount (in bytes) of dynamic shared memory. This is larger than the default dynamic shared memory on A100: `49152` bytes. In these cases, `cudaFuncSetAttribute` must be called explicitly to set  the`cudaFuncAttributeMaxDynamicSharedMemorySize` attribute of the kernel before launching it. Or, because AOT Inductor wrapper relies on the CUDA Driver API, the equivalent [`cuFuncSetAttribute`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g0e37dce0173bc883aa1e5b14dd747f26) function can be called to set the `CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute.

This PR adds the above call in the AOT Inductor codegen for every case when the required amount of dynamic SMEM is > 0. The call is done *within* the `launchKernel` function, meaning that it will happen only once per kernel and not affect the subsequent AOT Inductor-compiled model performance (after the first run).

P.S. One could, in principle, call the `cuFuncSetAttribute` only when the required amount of dynamic SMEM is above the default limit, but that would require detecting the default limit which is different on different devices. Assuming that the `cuFuncSetAttribute` is relatively lightweight and because it's performed only once per kernel, for simplicity, the suggestion is to call the function in every non-zero dynamic SMEM case.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py

...

----------------------------------------------------------------------
Ran 5 tests in 100.177s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107601
Approved by: https://github.com/jansel
2023-08-21 21:06:09 +00:00
chunyuan
c21e9de25d Inductor cpp wrapper: fix optional tensor input (#106847)
Fix cpp wrapper failure on `clip` in Torchbench:
```
RuntimeError: tensor does not have a device
```

An `optional<at::Tensor>` variable with value equal to `at::Tensor()` will be considered as _contains value_. When it's converted to `bool`, it returns `true`. While for `None` in python, when converting it to `bool`, `false` is returned.
Fix it to be an optional variable that _does not contain a value_.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106847
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-18 13:20:19 +00:00
Wang, Eikan
9921b48558 Extend Inductor to support the third-party backend (#106874)
## Summary

This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression.

## Root Cause

Regarding the C++/OpenMP backend,  `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`.
c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)

In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend.

```python
def init_backend_registration(self):
    if get_scheduling_for_device("cpu") is None:
        from .codegen.cpp import CppScheduling

        register_backend_for_device("cpu", CppScheduling, WrapperCodeGen)

    if get_scheduling_for_device("cuda") is None:
        from .codegen.triton import TritonScheduling

        register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen)
```

## Solution

To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back.

## Compilation Latency Performance Result
We ran a single model benchmark and reproduced the compilation regression:

- Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart`

- W/ PR #100706, the compilation latency is about **57~58**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7
```

- W/O PR #100706, the compilation latency is about **46~47**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7
```

This PR fixed the compilation performance regression.

- W/ this PR #106874, the compilation latency is about **47~48**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874
Approved by: https://github.com/jansel
2023-08-16 04:11:36 +00:00
Yanbo Liang
1819fe1324 Revert "Extend Inductor to support the third-party backend (#100706)" (#106652)
This reverts commit 05bd24bb35.

It caused compilation time regression on torchbench, huggingface and dynamic models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-08-05 06:41:08 +00:00
angelayi
b2d3a2f433 [inductor] Remove ReinterpretView copy_ for AOT Inductor outputs (#106564)
Running benchmark on HF models result in 71% pass rate now: P802905571
Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Fri%2C%2028%20Jul%202023%2005%3A02%3A20%20GMT&stopTime=Fri%2C%2004%20Aug%202023%2005%3A02%3A20%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/bench&lCommit=e35a655e59b2038c0395f972a1f567f862093d9c&rBranch=main&rCommit=3e5a52cedd2d586fc6cb40a73a098252b9edc2a1)

Originally, a lot of the HF export-aot-inductor tests are failing with the error message:
```
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```

I looked at the result of one of the models, AlbertForMaskedLM, and the error is due to an additional [`copy_`](https://www.internalfb.com/phabricator/paste/view/P802043305?lines=1460%2C1426%2C1438%2C1451%2C1428) being inserted at the end. Looking at the [exported graph](https://www.internalfb.com/phabricator/paste/view/P802908243?lines=1124), `buf237` in the cpp program corresponds to the `view_269` node. During inductor lowering, this `view_269` node will result in a `ir.ReinterpretView` node, and when generating code for the outputs, this [line](https://fburl.com/code/epola0di) will add an additional `copy_`.

I'm unsure if removing this case will result in other errors, but it seems to raise the HF model benchmark pass rate :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106564
Approved by: https://github.com/jansel
2023-08-04 07:51:29 +00:00
Wang, Eikan
05bd24bb35 Extend Inductor to support the third-party backend (#100706)
This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done.

Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code.

- Python wrapper code generation

  Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions.

- Kernel code generation

  It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions.

  -   [group_fn](71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64))
  - [flush](71c4becda7/torch/_inductor/scheduler.py (L1150))
  - [can_fuse_vertical](71c4becda7/torch/_inductor/scheduler.py (L1006))
  - [can_fuse_horizontal](71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64))
  - [codegen_template](71c4becda7/torch/_inductor/scheduler.py (L1234)) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._
  - [codegen_nodes](71c4becda7/torch/_inductor/scheduler.py (L1234))
  - [codegen_sync](71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._

  The third-party backend needs to inherit from the `Scheduling` class and implement these functions.

Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706
Approved by: https://github.com/jansel
2023-08-02 05:13:51 +00:00
Nikita Shulga
92cac6bf32 InductorCpp: Fix "call to constructor is ambiguous" error (#106418)
Not sure why `{{}}` is better that just calling a default constructor, but removing it fixes:
```
% python test_cpp_wrapper.py -v -k test_profiler_mark_wrapper_call_cpu_cpp_wrapper
....
clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_clang\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1002\" -I/var/lib/jenkins/workspace/test/inductor/-I/var/lib/jenkins/workspace/torch/include -I/var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -I/var/lib/jenkins/workspace/torch/include/TH -I/var/lib/jenkins/workspace/torch/include/THC -I/opt/conda/envs/py_3.9/include/python3.9 -isystem /var/lib/jenkins/workspace/torch/include -isystem /var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -isystem /var/lib/jenkins/workspace/torch/include/TH -isystem /var/lib/jenkins/workspace/torch/include/THC -isystem /opt/conda/envs/py_3.9/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -std=c++17 -Wno-unused-variable -O3 -ffast-math -fno-finite-math-only -march=native -fopenmp -Wall -DCPU_CAPABILITY_AVX512 -D C10_USING_CUSTOM_GENERATED_MACROS -c /tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp -o main.o
/tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp:41:50: error: call to constructor of 'c10::ArrayRef<c10::IValue>' is ambiguous
        RECORD_FUNCTION("inductor_wrapper_call", c10::ArrayRef<c10::IValue>({{}}));
                                                 ^                          ~~~~
/var/lib/jenkins/workspace/torch/include/ATen/record_function.h:580:38: note: expanded from macro 'RECORD_FUNCTION'
      at::RecordScope::FUNCTION, fn, inputs, ##__VA_ARGS__)
                                     ^~~~~~
/var/lib/jenkins/workspace/torch/include/ATen/record_function.h:561:20: note: expanded from macro 'RECORD_FUNCTION_WITH_SCOPE'
        guard, fn, inputs, ##__VA_ARGS__);                 \
                   ^~~~~~
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit move constructor)
class ArrayRef final {
      ^
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit copy constructor)
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:71:13: note: candidate constructor
  constexpr ArrayRef(const T& OneElt) : Data(&OneElt), Length(1) {}
            ^
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:126:28: note: candidate constructor
  /* implicit */ constexpr ArrayRef(const std::initializer_list<T>& Vec)
                           ^
1 error generated.
```
if clang12 is used as the host compiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106418
Approved by: https://github.com/desertfire
2023-08-02 04:02:15 +00:00
Sam Larsen
0cf918947d [inductor] Support using the 'stream' param in AOT mode (#105589)
Summary:
When in AOT mode, make use of the existing stream param:
- Pass through and use the stream param in the launchKernel helper function.
- In non-AOT mode, assign the stream param in the caller and pass to launchKernel
- Use a CUDAStreamGuard so all fallback ops execute on the stream
- CUDAStreamGuard subsumes CUDAGuard in AOT mode since it sets both stream and device

Test Plan:
- Ran cpp_wrapper tests: pytest test/inductor/test_cpp_wrapper.py
- Manually inspected cpp output from the alexnet benchmark:

  a) In AOT mode:
```
   static inline void launchKernel(
           CUfunction func,
           int gridX,
           int gridY,
           int gridZ,
           int numWraps,
           int sharedMemBytes,
           cudaStream_t stream) {
       AT_CUDA_DRIVER_CHECK_OVERRIDE(cuLaunchKernel(
           func, gridX, gridY, gridZ, 32*numWraps, 1, 1, sharedMemBytes, stream, args, nullptr));

   ...
   at::cuda::CUDAStreamGuard stream_guard(at::cuda::getStreamFromExternal(stream, 0));
   ...
   launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream);
   ...
```

   b) Regular cpp wrapper:
```
   ...
   at::cuda::CUDAGuard device_guard(0);
   cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0);
   ...
   launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream0);
   ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105589
Approved by: https://github.com/desertfire
2023-07-28 20:26:27 +00:00
Bin Bao
b0816e4714 [inductor] Fix AOTInductor output issues (#105773)
Summary: This is a follow-up on https://github.com/pytorch/pytorch/pull/105496. There are several issues with the previous fix,
1) It explicitly does copy for every output at the end of the main function;
2) When an output is ReinterpretView, no as_strided was generated for it;
3) There can be duplicated buffer declarations.

This PR fixes by making sure can_reuse behave consistently between two AOTIndcutor passes, and thus always generate the same set of kernels. It also adds handling of ReinterpretView.

Differential Revision: [D47692214](https://our.internmc.facebook.com/intern/diff/D47692214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105773
Approved by: https://github.com/jansel
2023-07-24 01:58:49 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
Justin Chu
79c5e33349 [BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436
Approved by: https://github.com/malfet, https://github.com/albanD
2023-07-21 07:38:46 +00:00
Shunting Zhang
1e87778552 [inductor] refactor wrapper benchmark code out of utils.py (#105584)
Refactor wrapper benchmark out of utils.py since
1. utils.py gets too large
2. I plan to add more code to wrapper benchmark for multi-kernel.

This is split out from https://github.com/pytorch/pytorch/pull/103469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584
Approved by: https://github.com/jansel
2023-07-21 00:01:35 +00:00
Bin Bao
71067631c2 [inductor] Fix an AOTInductor missing output issue (#105496)
Summary: When an output buffer is reused instead of directly referring to the passed-in output, we need to explictly make a copy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105496
Approved by: https://github.com/jansel
2023-07-20 08:27:31 +00:00
Bin Bao
b10de43c0a Add aot_inductor as a test backend for benchmarking (#105221)
Summary:
Original PR at https://github.com/pytorch/pytorch/pull/104977. Landing from fbcode instead.

Add an aot_inductor backend (Export+AOTInductor) in the benchmarking harness. Note it is not a dynamo backend.

Moved files from torch/_inductor/aot_inductor_include to torch/csrc/inductor as a more standard way for exposing headers
Created a caching function in benchmarks/dynamo/common.py for compiling, loading and caching the .so file, as a proxy for a pure C++ deployment, but easier for benchmarking.

Differential Revision: D47452591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105221
Approved by: https://github.com/jansel
2023-07-18 13:16:36 +00:00
chunyuan
1fdc88f877 Inductor cpp wrapper: fix codegen of FallbackKernel with kwargs (#104575)
Fix cpp wrapper failure on TorchBench model `hf_Reformer` with `randn`:
```
random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype)
```

For cpp wrapper, when `kwargs` is not empty, for `OpOverloadPacket` kernel, we need to know the exact overload schema to handle the `kwargs` properly when calling the cpp kernel: including finding the correct order of the kwargs and getting the default value for optional args without provided value when calling the function (`layout` in the above case).

The current support in this PR is conservative and we'll extend the functionality in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104575
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-15 03:33:44 +00:00
Bin Bao
528ab477ce [reland][inductor] Register an op for mm_plus_mm (#105153)
Summary: Reland https://github.com/pytorch/pytorch/pull/104835 after fixing internal build issues

Test Plan: CI

Differential Revision: D47442849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105153
Approved by: https://github.com/clee2000
2023-07-14 14:35:29 +00:00
Kefei Lu
4328138c1e AOT inductor: error: ‘c10::Dispatcher’ has not been declared (#104742)
Differential Revision: D47275262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104742
Approved by: https://github.com/desertfire
2023-07-14 01:47:52 +00:00
Catherine Lee
c36dca7bc5 Revert "[inductor] Register an op for mm_plus_mm (#104835)" (#105150)
This reverts commit 9c46a1620c.

Actual revert referenced in https://github.com/pytorch/pytorch/pull/105149

#104835 is causing internal builds to fail

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105150
Approved by: https://github.com/atalman
2023-07-13 17:13:45 +00:00
Bin Bao
9c46a1620c [inductor] Register an op for mm_plus_mm (#104835)
Summary: Currently the aten version of mm_plus_mm has no cpp
implementation, and thus cpp_wrapper can not generate the correct cpp
function call for it.

Differential Revision: [D47372057](https://our.internmc.facebook.com/intern/diff/D47372057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104835
Approved by: https://github.com/jansel, https://github.com/SherlockNoMad
2023-07-12 02:34:02 +00:00
chunyuan
ba167e6578 Inductor cpp wrapper: fix codegen of ScatterFallback (#104524)
Fix cpp wrapper failure on TorchBench model `basic_gnn_edgecnn` and `hf_Reformer` which contain scatter OP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104524
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-11 08:17:56 +00:00
XiaobingSuper
54f33265db inductor(re-land): support cpu fusion path for bfloat16 amp (#104399)
This PR is about the fusion of amp path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104399
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-10 00:58:04 +00:00
Bin Bao
a860b965f1 [inductor] Relax custom op schema checking for cpp_wrapper (#104349)
Summary: Remove fallback ops whitelist because FallbackKernel.set_cpp_kernel is doing sufficient checking

Differential Revision: [D47269612](https://our.internmc.facebook.com/intern/diff/D47269612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104349
Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel
2023-07-09 17:31:31 +00:00
XiaobingSuper
8ce3a18b6a inductor: reduce complie time by reducing repr calls of quantize or Opaque tensor (#104696)
For quantize or opaue tensor, if they are constant values, the calls of  tensor ```__repr__``` will have memory copy(https://github.com/pytorch/pytorch/blob/main/torch/_tensor_str.py#L550):
db1ac4e29b/torch/_inductor/codegen/wrapper.py (L289-L292)

for CPP codegen, there have many times of initiation of ```WrapperCodeGen```: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/cpp.py#L2023, which consumes much time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104696
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-07 01:12:34 +00:00
Yang Chen
d2281e38ae Adds the initial support for AOTInductor model and interface (#104202)
This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen.

It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference.

On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model.

This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths.

Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202
Approved by: https://github.com/desertfire
2023-06-27 00:37:26 +00:00
Jason Ansel
8c54cd434f [inductor] Fix allow_buffer_reuse=False (#103630)
Fixes #103461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103630
Approved by: https://github.com/anijain2305
2023-06-15 22:50:01 +00:00
chunyuan
17217d367f Inductor cpp wrapper: support Constant in input (#103496)
## Description
Fix cpp wrapper for models which have constants in the graph inputs.

Python wrapper directly gets the value inside the wrapper call as a global variable passed when calling:
4081e924a8/torch/_inductor/codecache.py (L757)
The constants value has been saved in `mod.__dict__` in
4081e924a8/torch/_inductor/graph.py (L874-L875)
For cpp wrapper, we need to append constants to the input args, so as to pass this python value to the `inductor_entry_cpp` function explicitly.

### Example
Example of output code for dlrm in TorchBench with this fix:
```py
module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'cfkc6c36t7cggi6mnokrdm5jhesnunjg5xysv3o3x3vaqmzmpe6r', False)

def _wrap_func(f):
    def g(args):
        args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        constants_tensor = [constant0, constant1]
        args_tensor.extend(constants_tensor)

        return f(args_tensor)
    return g
call = _wrap_func(module.inductor_entry_cpp)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103496
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-06-15 05:01:25 +00:00
Animesh Jain
58d2c66a70 [activation checkpointing] Higher order functional rng op wrappers (#102934)
Introduces two higher order operators
* run_and_save_rng_state - Saves the current rng state and then runs the op.
* run_with_rng_state - Runs the op with the rng state supplied as an input

Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934
Approved by: https://github.com/jansel, https://github.com/zou3519
2023-06-12 22:54:17 +00:00
Shunting Zhang
daf75c0759 [AOTAutograd] compare with stride hints (#103342)
We previously compare FakeTensor's strides with real tensor's strides. This cause dynamic dimension of FakeTensor being specialized to static int. This may cause a graph specialized for one shape being used by another shape which is wrong.

Use stride hints for the comparison instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103342
Approved by: https://github.com/malfet
2023-06-10 06:51:54 +00:00
chunyuan
d61cd03b97 Inductor cpp wrapper: support ConvTranspose and fix Convolution ir (#103308)
The changes in this PR include:
- Support ConvTranspose in cpp wrapper
- Fix cpp wrapper support for aten convolution when bias is `not None`: bias is in `args` instead of `kwargs` when it is `not None`. The change is covered by ConvTranspose dynamic shapes UT since we'll fall back to aten convolution in dynamic shape cases.
- Fix cpp wrapper support for `inf`. This is a UT added in https://github.com/pytorch/pytorch/issues/101865. The cpp wrapper UT is covered in `test_conv2d_unary` of `test_cpp_wrapper.py`. It's in `slowTest` category and seems not captured in the CI of that PR.

I will submit another PR to remove the hard-coded schema in these `ExternKernel`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103308
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-06-10 03:53:05 +00:00
David Berard
cde4657284 [inductor] Support complex fallback for convert_element_type, _fft_c2c, view_as_real to support GoogleFnet with cpp wrapper (#103183)
Fixes #102752

These 3 fallback kernels appear in GoogleFnet because they take complex arguments - i.e., usually they aren't fallback kernels. To support this model, we added support for these 3 ops.

Details:
1. Add these 3 ops to the allowlist. I assume that we eventually want to support all fallback kernels, but for now we just add these 3 ops to the allowlist.
2. Support complex64 in cpp codegen
3. Support List[] arguments and ScalarType arguments in cpp codegen
4. Allow alias_info in schema arguments. In the original PR supporting fallback kernels for cpp wrapper, ops with schemas with non-null alias_info for any of the arguments were disallowed; but I don't think there's any reason we need to disallow these in cpp wrapper code.

Caveats:
* This has not added support for complex32 or complex128
* It only works with static shapes, not dynamic shapes. It seems like the dynamic shapes issue is unrelated to cpp wrapper, since it fails in the test_torchinductor_dynamic_shapes.py test. I checked these `test_fft_.*` tests, which I added in this PR, and verified that they were broken with dynamic shapes before any of the code changes from this PR.

**Test**:

```
benchmarks/dynamo/huggingface.py --inductor --amp --accuracy --inference --device cuda   --cpp-wrapper --only GoogleFnet
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103183
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w
2023-06-09 21:12:41 +00:00
Bin Bao
fbbde8df69 [inductor] fix a numel expr codegen issue (#103005)
Summary: Correctly use pexpr or cexpr for generating symbolic expression
during wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103005
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
Bin Bao
44fdfd3222 [inductor] Support select_algorithm with cpp_wrapper (#103003)
Summary: This is one step towards getting cpp_wrapper work with max_autotune.
Switch to use unique kernel name to cache generated cubin file.

This is a copy of https://github.com/pytorch/pytorch/pull/102738 to solve a ghstack issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103003
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
Shunting Zhang
86c7652503 [inductor] layout optimization for conv (#99773)
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773
Approved by: https://github.com/jansel
2023-06-02 21:08:18 +00:00
chunyuan
4c9992d5ed Inductor cpp wrapper: cache the wrapper (#89743)
If the wrapper code has been built, directly load the .so file to avoid recompilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89743
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-02 00:02:39 +00:00
Bin Bao
c58264c3e9 [inductor] Support multiple symbolic numel expr in CudaWrapperCodeGen (#102093)
Summary: Add a set to avoid generating extra `auto` when seeing the
symbolic numel expression for the second time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102093
Approved by: https://github.com/jansel
2023-05-30 16:08:00 +00:00
chunyuan
3469f100f3 support ConvUnary in Inductor cpp wrapper (#101392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101392
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang
2023-05-26 15:52:06 +00:00
Natalia Gimelshein
68816e4fa9 Remove inplace buffers when original and mutation are both removed (#102289)
Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed).
This PR extends buffer removal to inplaced buffers that can be removed.

Generated kernel for e.g. ln changes from
```
def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
where in_out_ptr0 is unused in the kernel to
```
def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
and corresponding allocation/reuse lines in the wrapper are removed.
The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289
Approved by: https://github.com/jansel
2023-05-26 02:06:36 +00:00
Bin Bao
836798e0f3 [inductor] Support precomputed_sizes in CppWrapperCodeGen (#102083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102083
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-05-25 23:14:28 +00:00
Bin Bao
fd1d442185 [inductor] Add more dynamic shapes support for CudaWrapperCodeGen (#102019)
Summary: Use size hint for autotuning; Fix some symbol arg codegen
problem. More PRs coming for fixing unit test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102019
Approved by: https://github.com/jansel
2023-05-24 13:29:47 +00:00
Bin Bao
431344f2d0 [inductor] Refactor generate_kernel_call (#102018)
Summary: Refactor generate_kernel_call to support codegen call to Triton
kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102018
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-05-23 15:54:49 +00:00
Jason Ansel
0c6f409cda [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-20 03:43:33 +00:00