Commit Graph

215 Commits

Author SHA1 Message Date
Wei Wei
b19cf868e8 Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527)" (#113747)
Test Plan: sandcastle

Differential Revision: D51330618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-11-15 22:42:22 +00:00
Yang Chen
a144eb502a [aotinductor] add versions for the sdpa shim api (#113487)
In our first implemenation of the sdpa shim api, we didn't consider
the case where the optional scale argument could be None. It was
unnoticed because we always got a default argument for the cuda backend.
The issue was detected with the cpu backend.

This PR implements versioning for shim kernels. Currently, we only
have different versions for the sdpa api. We expect we would only
maintain a very small number of abi-compatible shim APIs that
had different versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113487
Approved by: https://github.com/int3, https://github.com/desertfire
2023-11-13 20:18:58 +00:00
Oguz Ulgen
6ea20f5dc5 [AOTI] Use expr_printer to print sympy expr (#113317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113317
Approved by: https://github.com/aakhundov, https://github.com/chenyang78
2023-11-13 20:14:04 +00:00
Jez Ng
7afb503e3c [inductor] Label align() with [[maybe_unused]] (#113502)
This squelches the "defined but not used" warning that occurs when
memory planning is disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113502
Approved by: https://github.com/jansel
2023-11-12 16:33:47 +00:00
Jez Ng
5e03af8295 [inductor] Enable floor_div indexing to work under ABI-compat mode (#113276)
Previously, floor_div operations were defined in
ATen/native/BinaryOps.h. Since this header was not included under
ABI-compat mode, trying to use those indexing operations would result in
compilation errors.

Technically, it is safe to use aten::native::floor_div_* functions in
ABI-compat mode as they are header-only; we could simply include
BinaryOps.h. However, there are other declarations in BinaryOps.h that
are not binary-compatible, so this is not ideal. Thus, I have moved those
functions into a separate file, and put them under c10/util, since they
don't really have tensor-specific logic.

c10 functions are not all header-only, so this still isn't ideal, but
this still seems like an improvement. Moreover, cpp_prefix.h -- used
when compiling cpp kernels -- already includes c10 header files, so
ABI-compatibility already depends on maintaining some c10 functions as
header-only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-11 02:51:29 +00:00
Oguz Ulgen
06dc2f162d [AOTI] Implement support for user defined kernels that use triton.autotune (#113229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113229
Approved by: https://github.com/chenyang78
2023-11-10 22:40:51 +00:00
Jez Ng
a2c32b8bd0 [inductor] Make codegen/{common,wrapper,cuda/cutlass_utils}.py pass follow_imports typechecking (#113411)
SymIntType is referenced by wrapper.py, so I added its .pyi definition.
I also added SymBoolType along the way for completeness.

The `insinstance` checks in wrapper.py reference torch.Type, which seems
to cause mypy to choke. Not entirely sure why; I've just added
type-ignore comments for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113411
Approved by: https://github.com/Skylion007
ghstack dependencies: #113409, #113410
2023-11-10 19:58:08 +00:00
PyTorch MergeBot
2cd8c0565c Revert "[AOTI] Implement support for user defined kernels that use triton.autotune (#113229)"
This reverts commit 1488bafb27.

Reverted https://github.com/pytorch/pytorch/pull/113229 on behalf of https://github.com/PaliC due to breaking test_aot_inductor.py tests though a forward fix is coming ([comment](https://github.com/pytorch/pytorch/pull/113229#issuecomment-1806159396))
2023-11-10 17:46:14 +00:00
Oguz Ulgen
1488bafb27 [AOTI] Implement support for user defined kernels that use triton.autotune (#113229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113229
Approved by: https://github.com/chenyang78
2023-11-10 01:39:00 +00:00
Jez Ng
297c26bb8e Support fp8 in AOTInductor + support optional<> in C ABI (#112527)
This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358

It turns out that we need to add support for optional types in order to
support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface
can't support optional<> directly, I am passing in optional types via
pointer instead.

`AtenTensorHandle`s are already pointers, so nothing needs to change
there. Only value types need to change.

We decided on this approach instead of adding an extra `bool` param to
the callee because this simplifies things. Having the same number of
arguments regardless of whether we are emitting Python / C++ /
ABI-compatible C++ makes codegen easier.

There are a number of existing ABI-compatible functions that have
optional-typed value parameters. Previously, they just assumed they
would never be passed a `nullopt` / `None` at runtime. Changing them to
use pointer types now would break ABI stability, so I have created an
exclude list for those functions.

Finally, I think the current implementation is kind of messy, and only
works for FallbackKernels, even though technically ExternKernels could
also have the same issue. It also doesn't support optional types nested
in lists. I've left FIXME comments for both issues.

Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-11-08 22:56:48 +00:00
Oguz Ulgen
8ba11bf79d [AOTI] Support non auto-tuned triton kernels in aoti (#113090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113090
Approved by: https://github.com/aakhundov, https://github.com/chenyang78, https://github.com/desertfire
2023-11-08 07:48:15 +00:00
Oguz Ulgen
dbf44dffc9 [Inductor] Cache generated user defined triton kernels on tensor dtype and non tensor parameters (#112752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112752
Approved by: https://github.com/jansel
2023-11-07 05:29:16 +00:00
Oguz Ulgen
13d62e28a3 [Inductor] Add Dynamic shape support to user defined triton kernels (#112523)
1) This PR moves the grid function codegen to wrapper so that we can use
   IndentBuffers as opposed to manually adding tabs for indentation.
2) In inductor, emits the grid function in the body of the kernel call so
   that it can use free symbols from dynamic shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112523
Approved by: https://github.com/Chillee
2023-11-02 23:58:50 +00:00
Jez Ng
ae85ba820f [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-11-02 07:39:13 +00:00
David Berard
8191fb3e06 [Reland2] [inductor][BE] split triton_meta and inductor_meta (#112351)
triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future.

This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton.

Tests - wait for CI

Differential Revision: [D50864493](https://our.internmc.facebook.com/intern/diff/D50864493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112351
Approved by: https://github.com/eellison
2023-11-02 00:40:12 +00:00
PyTorch MergeBot
74e6c877e9 Revert "[inductor] Memory planning (#112178)"
This reverts commit f64a97c6f8.

Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too f64a97c6f8 ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))
2023-11-01 00:03:56 +00:00
Jez Ng
f64a97c6f8 [inductor] Memory planning (#112178)
This was originally @jansel's PR:
https://github.com/pytorch/pytorch/pull/102625, which I've built upon.

This diff implements static memory planning. It's disabled by default
while we examine its performance.

We use a greedy-by-size approach. For dynamic shapes, the sizes of the
example inputs are used as estimates when making planning decisions. We
generate expressions to calculate the actual memory offsets and sizes at
runtime when the values of the dynamic shapes are known. In order to
simplify these calculations, we have organized the allocations into a
tree that branches on space (address offsets) and time (live ranges).
Finally, we need to align these offsets, so we have added an `align`
sympy Expr to express these calculations.

Some limitations:

1. It is only enabled during inference for now. Enabling it for training
   increases peak memory usage as we allocate all the memory needed for
   training upfront, before freeing the memory allocated during
   inference. We can probably address this by doing planning for both
   the inference and training passes together.
2. It doesn't work with PyTorch Distributed, because kernels like
   AllGatherIntoTensor codegen strings which do memory operations. We
   can fix this down the line by having them emit MemoryPlanningLines
   instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-31 20:02:30 +00:00
Yang Chen
94f3df27e4 [aotinductor] reland: return a copy of any constant (#112370)
When the model returns a constant, we cannot "release" its handle,
because the constant doesn't have any handle at all. Instead,
we should allocate a new tensor and then return a copy of the constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112370
Approved by: https://github.com/hl475, https://github.com/desertfire
2023-10-31 18:36:44 +00:00
chunyuan
f50ec341bc inductor cpp wrapper: add GIL release and acquire (#111888)
Support multiple instances inference (in different threads of the same process) as in https://github.com/pytorch/pytorch/issues/93524#issuecomment-1421816158.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111888
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-10-31 03:23:30 +00:00
Oguz Ulgen
1250032c2e [Inductor] Add triton.autotune support for user defined triton kernels with complex grids (#112290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112290
Approved by: https://github.com/jansel
2023-10-30 17:48:27 +00:00
Oguz Ulgen
c14c4efc0e [Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112228
Approved by: https://github.com/jansel
2023-10-28 17:30:35 +00:00
PyTorch MergeBot
8d44999183 Revert "[Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228)"
This reverts commit dbb31a2984.

Reverted https://github.com/pytorch/pytorch/pull/112228 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm test in trunk dbb31a2984 ([comment](https://github.com/pytorch/pytorch/pull/112228#issuecomment-1783660326))
2023-10-28 01:51:32 +00:00
Oguz Ulgen
dbb31a2984 [Inductor] Add triton.autotune support for user defined triton kernels with constant/simple grids (#112228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112228
Approved by: https://github.com/jansel
2023-10-27 21:40:22 +00:00
Bin Bao
f66cc67562 [aotinductor] Fix duplicated unbacked symbol declarations (#111823)
Summary: For https://github.com/pytorch/pytorch/issues/111711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111823
Approved by: https://github.com/ezyang, https://github.com/aakhundov
2023-10-26 21:11:08 +00:00
angelayi
b126adcdee [aotinductor] Pass TorchIR to AOTInductor (#110020)
Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code.
Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020
Approved by: https://github.com/desertfire
2023-10-26 15:54:31 +00:00
Oguz Ulgen
a29a844938 [Inductor] Support top level constants in user defined triton kernels (#111970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111970
Approved by: https://github.com/jansel
ghstack dependencies: #111956
2023-10-25 02:43:51 +00:00
Oguz Ulgen
bb550b25c9 [Inductor] Support user defined triton kernels calling other triton kernels and activation functions (#111956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111956
Approved by: https://github.com/jansel
2023-10-25 02:39:43 +00:00
Oguz Ulgen
ddcf9c050b [Inductor] Support calling user defined kernels with different type of arguments (#111939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111939
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #111770, #111808
2023-10-24 19:49:48 +00:00
Scott Wolchok
a0043d4840 [PyTorch] AOTI: cache dtypes and device types at DSO load (#111820)
Calling the `aoti_torch_{device_type,dtype}` functions on
each iteration can impose high costs on overhead-bound CPU models
because they can't be inlined across a DSO boundary. If we call them
on load, we can use simple load instructions at run time.

Differential Revision: [D50563682](https://our.internmc.facebook.com/intern/diff/D50563682/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111820
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #111815, #111816
2023-10-24 18:37:26 +00:00
Scott Wolchok
6afd00a318 [PyTorch] AOTI: use array of constants (#111815)
We continue to allow the user to set clients with a map, but under the hood we use an array of constants.

model_container thought it was OK to hand over the map, assume we just
kept a pointer, and then mutate the map later; I had to fix that. I
hope there aren't other sites that do the same thing...

Differential Revision: [D50111512](https://our.internmc.facebook.com/intern/diff/D50111512/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111815
Approved by: https://github.com/jansel, https://github.com/desertfire
2023-10-24 18:37:18 +00:00
Jez Ng
cbc6213f5d [inductor] Defer memory operation lowering to wrapper (#111402)
Right now, memory ops are being lowered to strings partly in
scheduler.codegen() and partly in wrapper.codegen(). But that makes
static memory planning (which is done entirely in `wrapper.codegen()`)
difficult to implement as information is "lost" by that point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111402
Approved by: https://github.com/jansel
2023-10-24 03:47:56 +00:00
Jez Ng
e264b42a2e [re-land][inductor] Refactor and optimize allocation calls (#111117) (#111511)
Summary:
This is a re-land of https://github.com/pytorch/pytorch/pull/111117 with
updates to our internal tests included.

This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.

This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.

This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.

Test Plan: contbuild & OSS CI, see 9ce0ae836d

Differential Revision: D50429424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111511
Approved by: https://github.com/jansel
2023-10-23 19:18:32 +00:00
Oguz Ulgen
2b2b6caf8f [inductor] Implement clone removal for user defined triton kernel via reinplace_scatters (#111627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111627
Approved by: https://github.com/jansel
ghstack dependencies: #111434
2023-10-22 22:28:00 +00:00
Oguz Ulgen
977d3bcc46 [Inductor] Support user defined triton kernels in inductor (#111434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434
Approved by: https://github.com/jansel
2023-10-22 17:04:19 +00:00
Jason Ansel
a1154e673b [Compiled Autograd] Turn accumulate_grad into an op (#111700)
Relands #111271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111700
Approved by: https://github.com/voznesenskym
2023-10-21 17:31:09 +00:00
PyTorch MergeBot
3eb5cae3af Revert "[Compiled Autograd] Turn accumulate_grad into an op (#111271)"
This reverts commit 04b04c0686.

Reverted https://github.com/pytorch/pytorch/pull/111271 on behalf of https://github.com/jeanschmidt due to Breaking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111271#issuecomment-1768527932))
2023-10-18 14:02:34 +00:00
PyTorch MergeBot
ed7739d690 Revert "[aot_inductor] return a copy of any constant (#111356)"
This reverts commit 71e1f34923.

Reverted https://github.com/pytorch/pytorch/pull/111356 on behalf of https://github.com/jeanschmidt due to Breaking internal ci ([comment](https://github.com/pytorch/pytorch/pull/111356#issuecomment-1768503640))
2023-10-18 13:51:30 +00:00
PyTorch MergeBot
08f580d498 Revert "[inductor] Refactor and optimize allocation calls (#111117)"
This reverts commit 9ce0ae836d.

Reverted https://github.com/pytorch/pytorch/pull/111117 on behalf of https://github.com/jeanschmidt due to Braking internal CI ([comment](https://github.com/pytorch/pytorch/pull/111117#issuecomment-1768489865))
2023-10-18 13:45:02 +00:00
soulitzer
2dc1726ab7 Compile NestedTensor with AOTAutograd (#110529)
This PR has a number of changes that improve subclass support for AOTAutograd/Inductor in general:
-  previously if a subclass does extra aliasing between graph outputs/inputs in a way, the partitioner would complain because grad_outputs are the outputs reused as-is. Now we do a view_as(self) to workaround this.
- Use dense -> dense metadata when working with fwd_output_strides during backward. This is important since the stride information comes from inductor which sees the dense to dense graph.
- Inductor requires that the inputs to the compiled backward to match some expected strides computed during compilation. We make sure to make the inner tensors of the subclass contiguous (previously, we only made the subclass itself contiguous)

Changes specific to NestedTensor relevant to compilation:
- Properly handle the case where `__tensor_unflatten__` is passed non-symbolic dense tensors and with meta extracted from fake subclasses.
- Skip var_to_range logic for singleton int
- Skip size hint logic in inductor for singleton int

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110529
Approved by: https://github.com/bdhirsh
2023-10-17 21:17:10 +00:00
Yang Chen
71e1f34923 [aot_inductor] return a copy of any constant (#111356)
When the model returns a constant, we cannot "release" its handle,
because the constant doesn't have any handle at all. Instead,
we should allocate a new tensor and then return a copy of the constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111356
Approved by: https://github.com/hl475
2023-10-17 08:44:21 +00:00
Jez Ng
9ce0ae836d [inductor] Refactor and optimize allocation calls (#111117)
This splits out changes from
https://github.com/pytorch/pytorch/pull/102625 to make things easier to
review.

This diff creates a `make_allocation()` method that extracts the logic
from `make_buffer_allocation()` while allowing us to allocate non-buffer
objects. In particular, we will use this to allocate memory pools during
memory planning.

This diff also includes a small optimization -- if the desired
allocation is contiguous, then we emit a call to `empty()` instead of
`empty_strided()` with its superfluous stride argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111117
Approved by: https://github.com/jansel
2023-10-17 03:06:52 +00:00
Jason Ansel
04b04c0686 [Compiled Autograd] Turn accumulate_grad into an op (#111271)
Rather than baking the behavior of `AccumulateGrad` nodes into the generated graph (either as `+=`, or as a return value of the graph).  This creates a new `accumulate_grad_` dispatcher op that is included in the generated graph like:
```
def forward(self, inputs, sizes, hooks):
    getitem = inputs[0]
    getitem_1 = inputs[1]
    getitem_2 = inputs[2]
    getitem_3 = inputs[3]
    getitem_4 = inputs[4]
    getitem_5 = inputs[5]
    getitem_6 = inputs[6]
    getitem_7 = inputs[7]
    getitem_8 = inputs[8]
    getitem_9 = inputs[9];  inputs = None
    expand = torch.ops.aten.expand.default(getitem, [2, 4]);  getitem = None
    threshold_backward = torch.ops.aten.threshold_backward.default(expand, getitem_1, 0);  expand = getitem_1 = None
    t = torch.ops.aten.t.default(getitem_3);  getitem_3 = None
    mm = torch.ops.aten.mm.default(threshold_backward, t);  t = None
    t_1 = torch.ops.aten.t.default(threshold_backward)
    mm_1 = torch.ops.aten.mm.default(t_1, getitem_2);  t_1 = getitem_2 = None
    t_2 = torch.ops.aten.t.default(mm_1);  mm_1 = None
    sum_1 = torch.ops.aten.sum.dim_IntList(threshold_backward, [0], True);  threshold_backward = None
    view = torch.ops.aten.view.default(sum_1, [4]);  sum_1 = None
    t_3 = torch.ops.aten.t.default(t_2);  t_2 = None
    accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, t_3);  getitem_4 = t_3 = None
    threshold_backward_1 = torch.ops.aten.threshold_backward.default(mm, getitem_5, 0);  mm = getitem_5 = None
    t_4 = torch.ops.aten.t.default(threshold_backward_1)
    mm_2 = torch.ops.aten.mm.default(t_4, getitem_6);  t_4 = getitem_6 = None
    t_5 = torch.ops.aten.t.default(mm_2);  mm_2 = None
    sum_2 = torch.ops.aten.sum.dim_IntList(threshold_backward_1, [0], True);  threshold_backward_1 = None
    view_1 = torch.ops.aten.view.default(sum_2, [4]);  sum_2 = None
    t_6 = torch.ops.aten.t.default(t_5);  t_5 = None
    accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_7, t_6);  getitem_7 = t_6 = None
    accumulate_grad__2 = torch.ops.inductor.accumulate_grad_.default(getitem_8, view_1);  getitem_8 = view_1 = None
    accumulate_grad__3 = torch.ops.inductor.accumulate_grad_.default(getitem_9, view);  getitem_9 = view = None
    return []

```

The motivation here is `AccumulateGrad` nodes are causing trouble in FSDP tracing, since FSDP is in-place resizing parameters and parameter storage in hooks.  We will model this mutation in dynamo, but not during the initial compiled autograd capture.  This allows us to bypass failing shape checks in the initial capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111271
Approved by: https://github.com/voznesenskym
2023-10-16 21:16:17 +00:00
Scott Wolchok
84975339bd [PyTorch] AOTI: generate reused thread_locals when tensors provably have static shape (#110892)
If a Tensor can be reused and has static shape, we can just cache it across iterations.

This is meant as a quickly shippable overhead reduction for CPU overhead-bound use cases that we can ship without relying on memory planning.

Differential Revision: [D50023678](https://our.internmc.facebook.com/intern/diff/D50023678/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110892
Approved by: https://github.com/bertmaher
ghstack dependencies: #110876, #110877, #110909
2023-10-13 16:07:05 +00:00
Bin Bao
6b4c686b9a [aotindutor] Forward fix a performance regression (#110800)
Summary: Forward fix a performance regression caused by https://github.com/pytorch/pytorch/pull/110510. When a model is run once, all those kernel pointers are initialized and removing the if-nullptr check will cause those loadKernel be unnecessarily executed again when we rerun the foward function. Another way to do this is to codegen loadKernel in the initializer, which I may do in a later PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110800
Approved by: https://github.com/jansel
2023-10-08 04:06:44 +00:00
Adnan Akhundov
abb00f66d8 [inductor] Add AOTI ABI shim function for repeat_interleave.Tensor (#110745)
Summary: `repeat_interleave.Tensor` doesn't have inductor lowering. To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_repeat_interleave
...
----------------------------------------------------------------------
Ran 4 tests in 70.526s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110745
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713
2023-10-07 08:18:01 +00:00
Bin Bao
298f01d9a2 [aotinductor] Avoid generating redundant kernel loading code (#110510)
Summary: 1) Stop forcing triton.unique_kernel_names to True for AOTInductor, because the unique kernel name can be read from metadata; 2) Only generate load_kernel once for each kernel since we don't have control flow in our generated code.  This solves https://github.com/pytorch/pytorch/issues/105553.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110510
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-10-05 19:59:38 +00:00
Sherlock Huang
f1b94461aa [AOTInductor] ProxyExecutor support Dynamic Shape (#110526)
Summary:
Extend ProxyExecutor to support dynamic shape.

Example of ProxyExecutor invocation with symints.
```
    int64_t* arg0_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size));
    auto s0 = arg0_1_size[0];
    auto s1 = arg0_1_size[1];
    int64_t* arg1_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg1_1, &arg1_1_size));
    auto s2 = arg1_1_size[0];
    auto s3 = arg1_1_size[1];
    ...
    aoti_torch_proxy_executor_call_function(proxy_executor, 0, 15, std::vector<int64_t>{42, 16, 17, s0 + s1, s0 + s1, s2*s3, 45, 67, 16, 17, s2*s3, s2*s3, s0 + s1, 89, 910}.data(), 7, std::vector<AtenTensorHandle>{arg0_1, arg0_1, arg1_1, buf2, arg0_1, arg1_1, buf4}.data());
```

Example of serialized SymInt(s) arguments:
```
          {
            "name": "symint",
            "arg": {
              "asSymInt": {
                "asName": "s0 + s1"
              }
            }
          },
          {
            "name": "symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s0 + s1"
                },
                {
                  "asName": "s2*s3"
                }
              ]
            }
          },
          ...
          {
            "name": "o_symint",
            "arg": {
              "asSymInt": {
                "asName": "s2*s3"
              }
            }
          },
          {
            "name": "o_symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s2*s3"
                },
                {
                  "asName": "s0 + s1"
                }
              ]
            }
          },
```

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Differential Revision: D49887555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110526
Approved by: https://github.com/chenyang78
2023-10-05 19:05:20 +00:00
Oleg Khabinov
cf1b494afd [AOTInductor] Store loaded kernels in the model (#110554)
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.

Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.

This change makes devices remembered at model level in AOT mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-10-05 10:17:05 +00:00
Kazuaki Ishizaki
434a996c42 Fix typo under torch/_inductor directory (#110530)
This PR fixes typo of comments and messages in files under `torch/_dynamo` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530
Approved by: https://github.com/kit1980
2023-10-05 02:17:20 +00:00
Yang Chen
46a5558cd5 [AOTInductor] Simplified AOTInductor interface and model class (#110411)
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.

It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.

Test Plan: ci

Reviewed By: khabinov

Differential Revision: D49835430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
2023-10-04 18:35:24 +00:00