Commit Graph

187 Commits

Author SHA1 Message Date
Scott Wolchok
a0043d4840 [PyTorch] AOTI: cache dtypes and device types at DSO load (#111820)
Calling the `aoti_torch_{device_type,dtype}` functions on
each iteration can impose high costs on overhead-bound CPU models
because they can't be inlined across a DSO boundary. If we call them
on load, we can use simple load instructions at run time.

Differential Revision: [D50563682](https://our.internmc.facebook.com/intern/diff/D50563682/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111820
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #111815, #111816
2023-10-24 18:37:26 +00:00
Scott Wolchok
6afd00a318 [PyTorch] AOTI: use array of constants (#111815)
We continue to allow the user to set clients with a map, but under the hood we use an array of constants.

model_container thought it was OK to hand over the map, assume we just
kept a pointer, and then mutate the map later; I had to fix that. I
hope there aren't other sites that do the same thing...

Differential Revision: [D50111512](https://our.internmc.facebook.com/intern/diff/D50111512/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111815
Approved by: https://github.com/jansel, https://github.com/desertfire
2023-10-24 18:37:18 +00:00
Jez Ng
cbc6213f5d [inductor] Defer memory operation lowering to wrapper (#111402)
Right now, memory ops are being lowered to strings partly in
scheduler.codegen() and partly in wrapper.codegen(). But that makes
static memory planning (which is done entirely in `wrapper.codegen()`)
difficult to implement as information is "lost" by that point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111402
Approved by: https://github.com/jansel
2023-10-24 03:47:56 +00:00
Jez Ng
e264b42a2e [re-land][inductor] Refactor and optimize allocation calls (#111117) (#111511)
Summary:
This is a re-land of https://github.com/pytorch/pytorch/pull/111117 with
updates to our internal tests included.

This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.

This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.

This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.

Test Plan: contbuild & OSS CI, see 9ce0ae836d

Differential Revision: D50429424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111511
Approved by: https://github.com/jansel
2023-10-23 19:18:32 +00:00
Oguz Ulgen
2b2b6caf8f [inductor] Implement clone removal for user defined triton kernel via reinplace_scatters (#111627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111627
Approved by: https://github.com/jansel
ghstack dependencies: #111434
2023-10-22 22:28:00 +00:00
Oguz Ulgen
977d3bcc46 [Inductor] Support user defined triton kernels in inductor (#111434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434
Approved by: https://github.com/jansel
2023-10-22 17:04:19 +00:00
Jason Ansel
a1154e673b [Compiled Autograd] Turn accumulate_grad into an op (#111700)
Relands #111271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111700
Approved by: https://github.com/voznesenskym
2023-10-21 17:31:09 +00:00
PyTorch MergeBot
3eb5cae3af Revert "[Compiled Autograd] Turn accumulate_grad into an op (#111271)"
This reverts commit 04b04c0686.

Reverted https://github.com/pytorch/pytorch/pull/111271 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111271#issuecomment-1768527932))
2023-10-18 14:02:34 +00:00
PyTorch MergeBot
ed7739d690 Revert "[aot_inductor] return a copy of any constant (#111356)"
This reverts commit 71e1f34923.

Reverted https://github.com/pytorch/pytorch/pull/111356 on behalf of https://github.com/jeanschmidt due to Breaking internal ci ([comment](https://github.com/pytorch/pytorch/pull/111356#issuecomment-1768503640))
2023-10-18 13:51:30 +00:00
PyTorch MergeBot
08f580d498 Revert "[inductor] Refactor and optimize allocation calls (#111117)"
This reverts commit 9ce0ae836d.

Reverted https://github.com/pytorch/pytorch/pull/111117 on behalf of https://github.com/jeanschmidt due to Braking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111117#issuecomment-1768489865))
2023-10-18 13:45:02 +00:00
soulitzer
2dc1726ab7 Compile NestedTensor with AOTAutograd (#110529)
This PR has a number of changes that improve subclass support for AOTAutograd/Inductor in general:
-  previously if a subclass does extra aliasing between graph outputs/inputs in a way, the partitioner would complain because grad_outputs are the outputs reused as-is. Now we do a view_as(self) to workaround this.
- Use dense -> dense metadata when working with fwd_output_strides during backward. This is important since the stride information comes from inductor which sees the dense to dense graph.
- Inductor requires that the inputs to the compiled backward to match some expected strides computed during compilation. We make sure to make the inner tensors of the subclass contiguous (previously, we only made the subclass itself contiguous)

Changes specific to NestedTensor relevant to compilation:
- Properly handle the case where `__tensor_unflatten__` is passed non-symbolic dense tensors and with meta extracted from fake subclasses.
- Skip var_to_range logic for singleton int
- Skip size hint logic in inductor for singleton int

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110529
Approved by: https://github.com/bdhirsh
2023-10-17 21:17:10 +00:00
Yang Chen
71e1f34923 [aot_inductor] return a copy of any constant (#111356)
When the model returns a constant, we cannot "release" its handle,
because the constant doesn't have any handle at all. Instead,
we should allocate a new tensor and then return a copy of the constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111356
Approved by: https://github.com/hl475
2023-10-17 08:44:21 +00:00
Jez Ng
9ce0ae836d [inductor] Refactor and optimize allocation calls (#111117)
This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.

This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.

This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111117
Approved by: https://github.com/jansel
2023-10-17 03:06:52 +00:00
Jason Ansel
04b04c0686 [Compiled Autograd] Turn accumulate_grad into an op (#111271)
Rather than baking the behavior of `AccumulateGrad` nodes into the generated graph (either as `+=`, or as a return value of the graph).  This creates a new `accumulate_grad_` dispatcher op that is included in the generated graph like:
```
def forward(self, inputs, sizes, hooks):
    getitem = inputs[0]
    getitem_1 = inputs[1]
    getitem_2 = inputs[2]
    getitem_3 = inputs[3]
    getitem_4 = inputs[4]
    getitem_5 = inputs[5]
    getitem_6 = inputs[6]
    getitem_7 = inputs[7]
    getitem_8 = inputs[8]
    getitem_9 = inputs[9];  inputs = None
    expand = torch.ops.aten.expand.default(getitem, [2, 4]);  getitem = None
    threshold_backward = torch.ops.aten.threshold_backward.default(expand, getitem_1, 0);  expand = getitem_1 = None
    t = torch.ops.aten.t.default(getitem_3);  getitem_3 = None
    mm = torch.ops.aten.mm.default(threshold_backward, t);  t = None
    t_1 = torch.ops.aten.t.default(threshold_backward)
    mm_1 = torch.ops.aten.mm.default(t_1, getitem_2);  t_1 = getitem_2 = None
    t_2 = torch.ops.aten.t.default(mm_1);  mm_1 = None
    sum_1 = torch.ops.aten.sum.dim_IntList(threshold_backward, [0], True);  threshold_backward = None
    view = torch.ops.aten.view.default(sum_1, [4]);  sum_1 = None
    t_3 = torch.ops.aten.t.default(t_2);  t_2 = None
    accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, t_3);  getitem_4 = t_3 = None
    threshold_backward_1 = torch.ops.aten.threshold_backward.default(mm, getitem_5, 0);  mm = getitem_5 = None
    t_4 = torch.ops.aten.t.default(threshold_backward_1)
    mm_2 = torch.ops.aten.mm.default(t_4, getitem_6);  t_4 = getitem_6 = None
    t_5 = torch.ops.aten.t.default(mm_2);  mm_2 = None
    sum_2 = torch.ops.aten.sum.dim_IntList(threshold_backward_1, [0], True);  threshold_backward_1 = None
    view_1 = torch.ops.aten.view.default(sum_2, [4]);  sum_2 = None
    t_6 = torch.ops.aten.t.default(t_5);  t_5 = None
    accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_7, t_6);  getitem_7 = t_6 = None
    accumulate_grad__2 = torch.ops.inductor.accumulate_grad_.default(getitem_8, view_1);  getitem_8 = view_1 = None
    accumulate_grad__3 = torch.ops.inductor.accumulate_grad_.default(getitem_9, view);  getitem_9 = view = None
    return []

```

The motivation here is `AccumulateGrad` nodes are causing trouble in FSDP tracing, since FSDP is in-place resizing parameters and parameter storage in hooks.  We will model this mutation in dynamo, but not during the initial compiled autograd capture.  This allows us to bypass failing shape checks in the initial capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111271
Approved by: https://github.com/voznesenskym
2023-10-16 21:16:17 +00:00
Scott Wolchok
84975339bd [PyTorch] AOTI: generate reused thread_locals when tensors provably have static shape (#110892)
If a Tensor can be reused and has static shape, we can just cache it across iterations.

This is meant as a quickly shippable overhead reduction for CPU overhead-bound use cases that we can ship without relying on memory planning.

Differential Revision: [D50023678](https://our.internmc.facebook.com/intern/diff/D50023678/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110892
Approved by: https://github.com/bertmaher
ghstack dependencies: #110876, #110877, #110909
2023-10-13 16:07:05 +00:00
Bin Bao
6b4c686b9a [aotindutor] Forward fix a performance regression (#110800)
Summary: Forward fix a performance regression caused by https://github.com/pytorch/pytorch/pull/110510. When a model is run once, all those kernel pointers are initialized and removing the if-nullptr check will cause those loadKernel be unnecessarily executed again when we rerun the foward function. Another way to do this is to codegen loadKernel in the initializer, which I may do in a later PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110800
Approved by: https://github.com/jansel
2023-10-08 04:06:44 +00:00
Adnan Akhundov
abb00f66d8 [inductor] Add AOTI ABI shim function for repeat_interleave.Tensor (#110745)
Summary: `repeat_interleave.Tensor` doesn't have inductor lowering. To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_repeat_interleave
...
----------------------------------------------------------------------
Ran 4 tests in 70.526s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110745
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713
2023-10-07 08:18:01 +00:00
Bin Bao
298f01d9a2 [aotinductor] Avoid generating redundant kernel loading code (#110510)
Summary: 1) Stop forcing triton.unique_kernel_names to True for AOTInductor, because the unique kernel name can be read from metadata; 2) Only generate load_kernel once for each kernel since we don't have control flow in our generated code.  This solves https://github.com/pytorch/pytorch/issues/105553.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110510
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-10-05 19:59:38 +00:00
Sherlock Huang
f1b94461aa [AOTInductor] ProxyExecutor support Dynamic Shape (#110526)
Summary:
Extend ProxyExecutor to support dynamic shape.

Example of ProxyExecutor invocation with symints.
```
    int64_t* arg0_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size));
    auto s0 = arg0_1_size[0];
    auto s1 = arg0_1_size[1];
    int64_t* arg1_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg1_1, &arg1_1_size));
    auto s2 = arg1_1_size[0];
    auto s3 = arg1_1_size[1];
    ...
    aoti_torch_proxy_executor_call_function(proxy_executor, 0, 15, std::vector<int64_t>{42, 16, 17, s0 + s1, s0 + s1, s2*s3, 45, 67, 16, 17, s2*s3, s2*s3, s0 + s1, 89, 910}.data(), 7, std::vector<AtenTensorHandle>{arg0_1, arg0_1, arg1_1, buf2, arg0_1, arg1_1, buf4}.data());
```

Example of serialized SymInt(s) arguments:
```
          {
            "name": "symint",
            "arg": {
              "asSymInt": {
                "asName": "s0 + s1"
              }
            }
          },
          {
            "name": "symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s0 + s1"
                },
                {
                  "asName": "s2*s3"
                }
              ]
            }
          },
          ...
          {
            "name": "o_symint",
            "arg": {
              "asSymInt": {
                "asName": "s2*s3"
              }
            }
          },
          {
            "name": "o_symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s2*s3"
                },
                {
                  "asName": "s0 + s1"
                }
              ]
            }
          },
```

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Differential Revision: D49887555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110526
Approved by: https://github.com/chenyang78
2023-10-05 19:05:20 +00:00
Oleg Khabinov
cf1b494afd [AOTInductor] Store loaded kernels in the model (#110554)
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.

Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.

This change makes devices remembered at model level in AOT mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-10-05 10:17:05 +00:00
Kazuaki Ishizaki
434a996c42 Fix typo under torch/_inductor directory (#110530)
This PR fixes typo of comments and messages in files under `torch/_dynamo` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530
Approved by: https://github.com/kit1980
2023-10-05 02:17:20 +00:00
Yang Chen
46a5558cd5 [AOTInductor] Simplified AOTInductor interface and model class (#110411)
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.

It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.

Test Plan: ci

Reviewed By: khabinov

Differential Revision: D49835430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
2023-10-04 18:35:24 +00:00
Bin Bao
539367f0bc [aotindutor] Refactor optional value codegen (#110233)
Summary: Simplify the codegen for optional values by using c10::nullopt, and we don't need placeholders like OptionalScalar because we can simply use None for that purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110233
Approved by: https://github.com/jansel
2023-10-04 17:18:02 +00:00
Sherlock Huang
50054b1a62 [AOTInductor] ProxyExecutor support ReinterpretView inputs (#110451)
Summary:
See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem.
```
            # NB, the return handle here represents a temporary tensor, which will be automatically
            # released.
            # Here's a sample usage in the cpp wrapper code:
            # ```
            # aoti_torch_addmm_out(
            #     buf1,
            #     arg1_1,
            #     RAIIAtenTensorHandle(tmp_tensor_handle_0),
            #     buf0,
            #     1L,
            #     1L));
            # ```
            # RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out.
            # This could be problematic when it's used in a different pattern, for example:
            # ````
            # AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
            # aoti_torch_proxy_executor_call_function(..., tensor_args);
            # ````
            # RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter
            # kernel call.
            return f"RAIIAtenTensorHandle({tmp_name})"
```

As a result, ProxyExecutor would generate following code, which cause invalid memory access.

Before:

```
    // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
    AtenTensorHandle tmp_tensor_handle_2;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
    ...
    AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
    int64_t int_args[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args);
    buf3.reset();
```

With fix in this diff, ProxyExecutor generates following code

After:

```
    // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
    AtenTensorHandle tmp_tensor_handle_2;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
    ...
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data());
    buf3.reset();
```

I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix.

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Reviewed By: desertfire

Differential Revision: D49758764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451
Approved by: https://github.com/desertfire
2023-10-04 02:20:31 +00:00
Mu-Chu Lee
836ba6430a [AOTInductor] Initial functionality for Inf and NaN checker (#109526)
Summary:
Add initial functionality for Inf and NaN checker for AOTInductor.

Test Plan:
Included in commit. Skipped for CI as SIGABRT can't be captured by pytest.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D49379751](https://our.internmc.facebook.com/intern/diff/D49379751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109526
Approved by: https://github.com/chenyang78
2023-10-03 22:39:42 +00:00
Yang Chen
da63c7f2c3 [AOTInductor] remove CUDA dependency for cpp backend (#110409)
Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.

Reviewed By: bertmaher, muchulee8, mikekgfb

Differential Revision: D49800712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409
Approved by: https://github.com/bertmaher, https://github.com/desertfire
2023-10-03 18:36:00 +00:00
Sherlock Huang
898656e9d1 [AOTInductor] ProxyExecutor supports Tuple of Tensor and List[Tensor] in returns (#110187)
Summary:
ProxyExecutor supports custom ops that return a tuple mixed of Tensor and List[Tensor]
e.g. `"fn_with_mix_outputs(Tensor t, Tensor[] tensors) -> (Tensor, Tensor[])"`

Example:
`out7, [out8, out9] = torch.ops.fb.fn_with_mix_outputs(out5, [out6, out4])`
got compiled into
```
    AtenTensorHandle buf11_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf11_handle));
    RAIIAtenTensorHandle buf11(buf11_handle);
    AtenTensorHandle buf12_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf12_handle));
    RAIIAtenTensorHandle buf12(buf12_handle);
    AtenTensorHandle buf13_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf13_handle));
    RAIIAtenTensorHandle buf13(buf13_handle);
    AtenTensorHandle tensor_args_var_7[] = {buf8.get(), buf9.get(), buf6.get(), buf11.get(), buf12.get(), buf13.get()};
    int64_t int_args_var_8[] = {};
    aoti_torch_proxy_executor_call_function(proxy_executor, 3, 0, int_args_var_8, 6, tensor_args_var_7);
```

Serialized extern node
```
    {
      "name": "buf10",
      "node": {
        "target": "fb::fn_with_mix_outputs",
        "inputs": [
          {
            "name": "t",
            "arg": {
              "asTensor": {
                "name": "buf8"
              }
            }
          },
          {
            "name": "tensors",
            "arg": {
              "asTensors": [
                {
                  "name": "buf9"
                },
                {
                  "name": "buf6"
                }
              ]
            }
          }
        ],
        "outputs": [
          {
            "asTensor": {
              "name": "buf11"
            }
          },
          {
            "asTensors": [
              {
                "name": "buf12"
              },
              {
                "name": "buf13"
              }
            ]
          }
        ],
        "metadata": {}
      }
    }
```

Test Plan: Test

Differential Revision: D49710320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110187
Approved by: https://github.com/chenyang78
2023-09-30 19:47:01 +00:00
Adnan Akhundov
2ead6c2f6e Skip launching kernels with zero grid in AOT Inductor (#110312)
Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen.

In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312
Approved by: https://github.com/chenyang78
2023-09-30 09:12:56 +00:00
Sherlock Huang
d7de26804e [AOTInductor] ProxyExecutor supports List[Tensor] return type (#110182)
Summary:
Support custom ops returns List[Tensor] type, like `"fn_with_list_output(Tensor[] tensors, int i) -> Tensor[]"`

As an example
`out5, out6 = torch.ops.fb.fn_with_list_output([out3, out4], 1)`

got compiled into

```
    AtenTensorHandle buf8_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf8_handle));
    RAIIAtenTensorHandle buf8(buf8_handle);
    AtenTensorHandle buf9_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf9_handle));
    RAIIAtenTensorHandle buf9(buf9_handle);
    AtenTensorHandle tensor_args_var_5[] = {buf5.get(), buf6.get(), buf8.get(), buf9.get()};
    int64_t int_args_var_6[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, int_args_var_6, 4, tensor_args_var_5);
```

Test Plan: Test

Differential Revision: D49694691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110182
Approved by: https://github.com/chenyang78
2023-09-29 18:21:48 +00:00
Yang Chen
30759848fa [inductor] handle non-list/tuple outputs for FallbackKernel (#110145)
generate_output may return non-list/tuple outputs. Let's force
those to be list, because we will enumerate kernel.outputs
later in the codegen.

Also fixed a minor issue in an assertion message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110145
Approved by: https://github.com/aakhundov
2023-09-29 17:13:26 +00:00
Sherlock Huang
7f2b51c668 [AOTInductor] ProxyExecutor supports custom op with tuple output (#110140)
Summary:
Extend ProxyExecutor to support custom ops with tuple outputs.

Generated wrapper code for `out3, out4 = torch.ops.fb.fn_with_tuple_output(out2, 1)`

```
    AtenTensorHandle buf5_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf5_handle));
    RAIIAtenTensorHandle buf5(buf5_handle);
    AtenTensorHandle buf6_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf6_handle));
    RAIIAtenTensorHandle buf6(buf6_handle);
    AtenTensorHandle tensor_args_var_3[] = {buf3.get(), buf5.get(), buf6.get()};
    int64_t int_args_var_4[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args_var_4, 3, tensor_args_var_3);
```

Test Plan: Test

Differential Revision: D49673994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110140
Approved by: https://github.com/chenyang78
2023-09-28 02:50:39 +00:00
Sherlock Huang
ec5bbef8af [AOTInductor] Switch ProxyExecutor to use AtenTensorHandle (#109748)
Summary: Switch ProxyExecutor to use AtenTensorHandle.

Test Plan: E2E Test

Differential Revision: D49471659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109748
Approved by: https://github.com/yifuwang, https://github.com/desertfire, https://github.com/chenyang78
2023-09-27 17:51:30 +00:00
Yang Chen
4d0ae7c9da [inductor] support _scaled_dot_product_flash_attention fallback (#110085)
Summary:
This PR supports _scaled_dot_product_flash_attention fallback kernel.
Note that in the abi_compatible mode, we retrieve outputs by passing
output argument pointers rather than relying on std::get.

It also fixes an issue related to dynamic shapes, where we wrongfully
query undefined dynamic symbols.

Test Plan: ci

Reviewed By: frank-wei

Differential Revision: D49620191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110085
Approved by: https://github.com/desertfire
2023-09-27 00:09:56 +00:00
Bin Bao
993530ee4f [aotinductor] Relax the CUDAGuard device index check (#110030)
Summary: Although AOTInductor only supports running on a single cuda device, it does work in the case where there is a mix of cpu and cuda ops. So instead of asserting if a CUDA index appears for the first time, we check if there is only one cuda device index. This solves https://github.com/pytorch/pytorch/issues/109655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110030
Approved by: https://github.com/jansel
2023-09-26 16:23:23 +00:00
Bin Bao
4bf1cd6961 [aotinductor] Rename aot_runtime to aoti_runtime (#110007)
Summary: Make the naming more explicit

Differential Revision: D49593528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110007
Approved by: https://github.com/houseroad
2023-09-26 00:46:54 +00:00
Yu, Guangye
e9c9b1ed59 [Inductor] Generalize inductor triton backend device agnostic (#109486)
# Motivation
@jansel As discussed before, we expected to generalize some cuda-specific code. This can make inductor more friendly to third-party backend so that we can leverage inductor code as much as possible.

# Solution
To implement this, we give a solution to introduce device runtime abstraction. We wrapper them inside `DeviceInterface` and use `register_interface_for_device` to register each kind of device to inductor. Then use `get_interface_for_device` to fetch the corresponding runtime from device type. Then usage is like this:
```python
device_interface = get_interface_for_device("xpu")
device_interface .is_available() # to check if XPU is available
device_interface .device_count() # to check how much XPU device is available
```
The `DeviceInterface` is a simple abstraction, which enables third-party backends that implement CUDA-like semantics to be integrated with inductor. This can prevent third-party backend from using monkey patch to override some utility functions, like `decode_device` that is hard-coded with CUDA.

# Additional Context
The main code change:
- To leverage AsyncCompile, make it device-agnostic
- Avoid monkey patches, make some utility functions device-agnostic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109486
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/EikanWang
2023-09-24 07:49:20 +00:00
Oleg Khabinov
54faedf5f2 [AOTInductor] Load model on arbitrary device (#109816)
Reviewed By: desertfire

Differential Revision: D49402404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109816
Approved by: https://github.com/chenyang78
2023-09-23 04:45:20 +00:00
Bin Bao
c27c56a5c4 [inductor] Add back a missing header include (#109845)
Summary: It was removed in https://github.com/pytorch/pytorch/pull/109678, which regressed GoogleFnet in HF.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109845
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2023-09-22 17:06:06 +00:00
Bin Bao
d7dfa91e12 [inductor] Refactor some libtorch c shim interfaces (#109834)
Summary: Change the returned values to be in the back of the parameters, because 1) it is more consistent with AOTInductor runtime API convention; 2) because the out-variant ops have the out tensor at the beginning of parameters, this makes the return values more distinguished from those

Test Plan:
```
buck test mode/opt caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark
```

Differential Revision: D49522928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109834
Approved by: https://github.com/chenyang78
2023-09-22 12:45:23 +00:00
Bin Bao
8856c1628e [inductor] Change AOTInductor to return output tensors (#109790)
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:

* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.

This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.

Differential Revision: D49502318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
2023-09-22 02:31:52 +00:00
Bin Bao
9c2715bbb2 [inductor] Clean up AOTInductor runtime ABI (#109678)
Summary: Change the AOTInductor runtime interface to avoid referring to aten data structures directly, mostly at::Tensor and ProxyExecutor. This a combination of https://github.com/pytorch/pytorch/pull/109436,  https://github.com/pytorch/pytorch/pull/109498, https://github.com/pytorch/pytorch/pull/109450, https://github.com/pytorch/pytorch/pull/109606, plus a few internal build changes.

Reviewed By: frank-wei

Differential Revision: D49374820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109678
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2023-09-21 00:25:24 +00:00
Yang Chen
1c4e811565 replace data_ptr with aoti_torch_get_data_ptr for cpp codegen (#109615)
Summary:
in cpp codege, we should use aoti_torch_get_data_ptr
for retrieving aten tensor pointers if abi_compatible is true

Test Plan: ci

Reviewed By: bertmaher

Differential Revision: D49411392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109615
Approved by: https://github.com/bertmaher, https://github.com/desertfire, https://github.com/jansel
2023-09-20 17:26:17 +00:00
Edward Z. Yang
b771c04d6e Handle unbacked symints in buffer reuse calculation (#109603)
This is rewritten from https://github.com/pytorch/pytorch/pull/106655 to land faster, with peterbell10's comments.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109603
Approved by: https://github.com/yf225
2023-09-20 16:54:57 +00:00
Bin Bao
0f646b1d15 [inductor] Add a C shim layer for libtorch (#109391)
Summary:
This PR adds a limited C shim layer for libtorch. The ultimate goal is to ban any direct reference to aten/c10 data structures or functions, to avoid ABI breakage by providing stable C interfaces.

To make the review and landing easier, we broke the changes into several steps. In this PR (a combination of https://github.com/pytorch/pytorch/pull/109022 and https://github.com/pytorch/pytorch/pull/109351), we add C interfaces for certain libtorch functions and modify the wrapper codegen to generate calls to those interfaces. There are a few other items to be addressed in future PRs:

* The AOTInductor runtime interface still takes lists of aten tensors as input and output
* The interaction with ProxyExecutor (general fallback support) needs to move away from aten tensor
* Remove all references to aten/c10 headers in the AOTInductor-generated code

Differential Revision: D49302669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109391
Approved by: https://github.com/chenyang78
2023-09-16 16:46:26 +00:00
Oleg Khabinov
cc03e3a892 [AOTInductor] Do not hardcode directory with .cubin files (#109151)
Reviewed By: frank-wei, chenyang78

Differential Revision: D49081883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109151
Approved by: https://github.com/chenyang78
2023-09-15 18:38:05 +00:00
Yang Chen
9cd4548f01 AOTInductor dynamic shape (#109012)
Summary: This PR adds dynamic-shape support for AOTInductor

* On the runtime/interface side, we added two structs, StaticDimInfo
and DynamicDimInfo, to hold values for static and dynamic dimensions,
respectively. Dynamic dimensions are tracked by an unordered map field
defined in AOTInductorModelBase. At inference time, the inference run
method will assign the current real dimensional value to each dynamic
dimension before executing any kernel.

* On the CUDA wrapper codegen side, we generate dynamic symbols
appropriately for shape computations. We simulate kernel launch grids
in the C++ land by re-using the grid functions from the Python world.
The returned grid configs, which may contain symbolic expressions,
are printed out in their C++ forms via the CppPrinter. Note that
when dynamic shapes are involved, we have to compute grid configs
for each kernel at runtime in the same way as we do for launching
the corresponding Triton kernel. Otherwise, we may end up with
memory-access failures or mis-computations caused by invalid indices
for fetching or storing data in device memory.

Differential Revision: D49100472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475
2023-09-14 08:00:30 +00:00
pytorchbot
faa5985dfe Fix issue when input/output buffer of functional collective (e.g. allreduce / allgather) is incorrectly reused later (#108811)
For this program:
```python
def func(a, *, tag, ranks, group_size):
    ar = torch.ops.c10d_functional.all_reduce(a, "sum", tag, ranks, group_size)
    ar = torch.ops.c10d_functional.wait_tensor(ar)
    c = torch.relu(a)
    # c = a
    d = torch.matmul(c, c)
    e = d + ar
    return (e,)
```
the generated code is:
```python
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (4, 4), (4, 1))
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1) # no-op to ensure context
        buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
        buf0.copy_(arg0_1) #no reuse
        buf1_pg = c10d._find_or_create_pg_by_ranks_and_tag('', [0, 1], 2)
        buf1 = buf0
        buf1_work = dist.all_reduce(buf1, async_op=True, group=buf1_pg, op=fun_col_impl._str_to_reduce_op('sum'))
        fun_col_impl._register_tensor_work(buf1, buf1_work)
        del buf1
        buf0 = _wait_tensor(buf0)
        buf2 = buf0
        buf3 = buf0; del buf0  # reuse
        # Source Nodes: [relu], Original ATen: [aten.relu]
        stream1 = get_cuda_stream(1)
        triton_poi_fused_relu_0.run(arg0_1, buf3, 16, grid=grid(16), stream=stream1)
        del arg0_1
        buf4 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32)
        # Source Nodes: [add, relu], Original ATen: [aten.add, aten.relu]
        extern_kernels.addmm(buf2, buf3, buf3, alpha=1, beta=1, out=buf4)
        return (buf4, )
```
We can notice that allreduce input (`buf1` which is alias of `buf0`) is incorrectly reused as input (`buf3`) to the triton `triton_poi_fused_relu_0` inplace kernel, diverging from eager mode logic.

In general, we should make it so that Inductor doesn't try to reuse the input buffer to an inplace functional collective.

We have a similar problem for output buffer of out-of-place functional collectives, see https://github.com/pytorch/pytorch/issues/108780#issuecomment-1714921994.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108811
Approved by: https://github.com/Chillee, https://github.com/wconstab
2023-09-13 21:39:37 +00:00
Ying Zhang
097fd43f8c [Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931)
This is the step 4 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931
Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng
ghstack dependencies: #107802, #107847, #107901
2023-09-12 17:44:38 +00:00
PyTorch MergeBot
5a7c008b30 Revert "[ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141)"
This reverts commit 8ff00360a4.

Reverted https://github.com/pytorch/pytorch/pull/105141 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/105141#issuecomment-1715629007))
2023-09-12 12:29:55 +00:00
Jack Taylor
8ff00360a4 [ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141)
Follows from previous enablement attempt: https://github.com/pytorch/pytorch/pull/101797

Adds support for hsaco binaries in inductor's cpp_wrapper codegen and enables the CUDA tests in test_cpp_wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105141
Approved by: https://github.com/jansel
2023-09-09 16:28:56 +00:00