Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
Changes:
* `torch.export` will return a functional ATen graph w/o decompositions
* `exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.
Calling convention for Executorch stays the same:
```
pre_autograd_graph = capture_pre_autograd_graph(f, args, ...)
aten_graph_no_decomps = torch.export.export(pre_autograd_graph, args, ...)
# Within to_edge we decompose to core aten and then convert to edge
edge_graph = exir.to_edge(aten_graph_no_decomps)
```
Test Plan: CI
Differential Revision: D49742989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110410
Approved by: https://github.com/ydwu4
We want to get to a point where most UserErrors link to exportdb examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise UserErrors to make or point to examples that make fixing such errors more obvious for users.
In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.
Retry of #110733 which was reverted due to a landrace.
Differential Revision: [D50087148](https://our.internmc.facebook.com/intern/diff/D50087148/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110878
Approved by: https://github.com/gmagogsfm, https://github.com/tugsbayasgalan
We want to get to a point where most `UserError`s link to `exportdb` examples. This PR makes passing case names non-optional to make this intent clearer and encourage developers who raise `UserError`s to make or point to examples that make fixing such errors more obvious for users.
In addition, sometimes there are multiple examples that are relevant to an error. Thus this PR also enables passing multiple case names.
Differential Revision: [D50020465](https://our.internmc.facebook.com/intern/diff/D50020465/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110733
Approved by: https://github.com/zhxchen17
Previously,`Dim` definitions that shared the same name but had different ranges were allowed to appear in the `dynamic_shapes` argument of an `export` call. They would correspond to the *same* dynamic dimension (identified by the shared name) with an effective range would be the *intersection* of the different ranges.
However this behavior can be confusing, because having different definitions with the same name is more likely than not unintentional. Therefore, this PR makes it a user error.
We still allow different definitions with the same name to exist at the same time (no global uniqueness) as long as they are not confused in the same `export` call. Redefinitions with the same bounds are also allowed, in case they are accidentally created by executing the same code multiple times.
Differential Revision: [D49965944](https://our.internmc.facebook.com/intern/diff/D49965944/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110638
Approved by: https://github.com/zhxchen17
Summary: The runtime assertions inserted in the `torch._export.export` by the `_AddRuntimeAssertionsForInlineConstraintsPass` lead to errors in AOT Inductor like #109884. In `torch._export.aot_compile` export and AOT compilation are run consecutively which would lead to the above issue if any assertions are inserted.
In this PR, we're adding a new parameter / flag to `torch._export.aot_compile`, `remove_runtime_assertions`, to remove the assertions inserted during export before AOT compilation. The flag is set to `False` for BC.
Additionally, we remove the flag `add_runtime_assertions_for_inline_constraints` recently added to `torch._dynamo.config`, as it can lead to undesirable `torch._export` behavior and is 's no longer required for the AOT Inductor testing purposes.
Test Plan: CI
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110710
Approved by: https://github.com/zhxchen17, https://github.com/chenyang78
Ideally all `_dynamo.exc.UserError`s should have "case names", i.e., link to examples in `exportdb`.
This PR adds case names to several instances of `_dynamo.exc.UserError`. In particular, looking at coverage based on `UserErrorType`:
* `DYNAMIC_CONTROL_FLOW`, `ANTI_PATTERN`, and `STANDARD_LIBRARY` are fully covered.
* `CONSTRAINT_VIOLATION` and `DYNAMIC_DIM` have no coverage. We don't seem to have any dedicated examples of specifying dynamic shapes in `exportdb` (although they are used in some other examples without explanation, to avoid some specialization that would make such examples moot).
* `INVALID_INPUT` is only partly covered. Frankly this is tedious to cover via examples.
Differential Revision: [D49928518](https://our.internmc.facebook.com/intern/diff/D49928518/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110555
Approved by: https://github.com/angelayi, https://github.com/ydwu4
Summary:
See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem.
```
# NB, the return handle here represents a temporary tensor, which will be automatically
# released.
# Here's a sample usage in the cpp wrapper code:
# ```
# aoti_torch_addmm_out(
# buf1,
# arg1_1,
# RAIIAtenTensorHandle(tmp_tensor_handle_0),
# buf0,
# 1L,
# 1L));
# ```
# RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out.
# This could be problematic when it's used in a different pattern, for example:
# ````
# AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
# aoti_torch_proxy_executor_call_function(..., tensor_args);
# ````
# RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter
# kernel call.
return f"RAIIAtenTensorHandle({tmp_name})"
```
As a result, ProxyExecutor would generate following code, which cause invalid memory access.
Before:
```
// Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
AtenTensorHandle tmp_tensor_handle_2;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
...
AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
int64_t int_args[] = {1};
aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args);
buf3.reset();
```
With fix in this diff, ProxyExecutor generates following code
After:
```
// Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
AtenTensorHandle tmp_tensor_handle_2;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
...
aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data());
buf3.reset();
```
I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix.
Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops
Reviewed By: desertfire
Differential Revision: D49758764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451
Approved by: https://github.com/desertfire
Summary:
Previously we used export's constraints to specify all batch-size dimensions being dynamic. This is done by creating 1 constraint `dynamic_dim(inp[0][0], lower, upper)`, followed by `dynamic_dim(inp[0][0]) == dynamic_dim(inp[i][0])` for every input `i`.
Through the new `dynamic_shapes` API, we can use `Dims("batch_size")` on every dimension to specify which dimensions are dynamic and equal to each other, and `None` otherwise: `{i: [Dims("batch_size", lower, upper), None] for every input i}`
Note: `dynamic_shapes` and `constraints` utilize the same "constraints" backend so this diff should be idempotent.
Test Plan: `buck2 run @//mode/dev-nosan //caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark`
Reviewed By: chenyang78, aakhundov
Differential Revision: D49784351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110360
Approved by: https://github.com/desertfire
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130
`exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.
Splitting up this diff with the following one (D49742989) to make migrating Executorch easier:
1. Land this diff
1. Wait for a pytorch nightly to include this diff
1. Update executorch's pytorch nightly
1. Land the following diff to have export() return no decomps
Test Plan: Tested in following diff
Differential Revision: D49743208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110236
Approved by: https://github.com/gmagogsfm
Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen.
In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312
Approved by: https://github.com/chenyang78
An `ExportedProgram`'s `__call__` signature is different from the original module, so `dynamic_shapes` that follow the original signature would fail when applied to re-export an `ExportedProgram`.
This PR fixes this issue, in other words, the original `dynamic_shapes` should now work when re-exporting.
Differential Revision: D49764011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110276
Approved by: https://github.com/tugsbayasgalan
A resubmit of https://github.com/pytorch/pytorch/pull/108447. Copy over the descriptions:
This is a follow-up of the discussion in https://github.com/pytorch/pytorch/pull/108356, where we want to repalce source_fn with source_fn_stack
Before this PR, for the following example:
```python
backend = EagerAndRecordGraphs()
@torch.compile(backend=backend, fullgraph=True)
def cond_f(pred, pred2, x, y):
def true_fn(pred2, x, y):
return x + y
def false_fn(pred2, x, y):
def true_fn2(x, y):
return x.sin() - y.cos()
def false_fn2(x, y):
return x.cos() - y.sin()
return control_flow.cond(pred2, true_fn2, false_fn2, (x, y))
return control_flow.cond(pred, true_fn, false_fn, (pred2, x, y))
```
The graph captured is shown below:
```python
class GraphModule(torch.nn.Module):
def forward(self, L_pred_ : torch.Tensor, L_pred2_ : torch.Tensor, L_x_ : torch.Tensor, L_y_ : torch.Tensor):
l_pred_ = L_pred_
l_pred2_ = L_pred2_
l_x_ = L_x_
l_y_ = L_y_
cond_true_1 = self.cond_true_1
cond_false_1 = self.cond_false_1
cond = torch.ops.higher_order.cond(l_pred_, cond_true_1, cond_false_1, [l_pred2_, l_x_, l_y_]); l_pred_ = cond_true_1 = cond_false_1 = l_pred2_ = l_x_ = l_y_ = None
return (cond,)
class GraphModule(torch.nn.Module):
def forward(self, l_pred2_, l_x_, l_y_):
add = l_x_ + l_y_; l_x_ = l_y_ = None
return add
class GraphModule(torch.nn.Module):
def forward(self, l_pred2_, l_x_, l_y_):
cond_true_0 = self.cond_true_0
cond_false_0 = self.cond_false_0
cond = torch.ops.higher_order.cond(l_pred2_, cond_true_0, cond_false_0, [l_x_, l_y_]); l_pred2_ = cond_true_0 = cond_false_0 = l_x_ = l_y_ = None
return cond
class GraphModule(torch.nn.Module):
def forward(self, l_x_, l_y_):
sin = l_x_.sin(); l_x_ = None
cos = l_y_.cos(); l_y_ = None
sub = sin - cos; sin = cos = None
return sub
class GraphModule(torch.nn.Module):
def forward(self, l_x_, l_y_):
cos = l_x_.cos(); l_x_ = None
sin = l_y_.sin(); l_y_ = None
sub = cos - sin; cos = sin = None
return sub
```
the source_fn for inner cond, sin, cos will be a (name, target) tuple:
```
('cond', <torch._ops.HigherOrderOperator object at xxx>)
('sin', 'sin')
('cos', 'cos')
('sub'. <built-in function sub>)
```
After this pr, the source_fn_stack will be a list of (name, target) tuple. The bottom of stack is the end of the list.
```
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>)],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sin', 'sin')],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cos', 'cos')]
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sub', <built-in function sub>)]
```
Test Plan:
See added tests in test_higher_order_ops.py and modify existing test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108595
Approved by: https://github.com/angelayi, https://github.com/zou3519
Recently we updated the `export` API to take an experimental `dynamic_shapes` argument that was meant to subsume the existing `constraints` argument.
This PR deprecates `constraints` (with a warning on its use, but without actually removing it). Simultaneously it replaces all uses of `constraints` in docs, examples, and tests with corresponding uses of `dynamic_shapes` (preserving behavior). This exercise fortunately revealed some minor bugs in the implementation which have also been fixed in this PR.
Some uses of `constraints` still remain, e.g., when `torch._dynamo.export` is called directly. (Meta-internal uses will be updated in a separate diff.)
Differential Revision: D49676049
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110143
Approved by: https://github.com/tugsbayasgalan
Based on William's recent diff on preserving node metadata on retracing, we no longer need to skip dynamo on retracing. This softens our previous restriction of not allowing any new constraints from user side because we can utilize dynamo to analyze through constraints now. As a result, re-export can technically happen with any new constraints. This opens up another problem that "Is it ok to use more loose constraints on the retracing?" If we allow loose constraints, we can technically diverge from eager behaviour because for example we could have eliminated unsafe control flow based on previous assumption. But we can also argue this is ok because we can say we treat the Exported callable to be an independent callable from its' original source code.
We can technically ban loose constraints inside export, but my concern is we are breaking abstraction by doing special case checks on ExportedProgram.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109476
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
Our experience using `constraints` / `dynamic_dim` with the existing export API has found it to be (subjectively) clunky and (objectively) verbose in common cases.
This PR implements a new design for the export API that replaces the use of `constraints` / `dynamic_dim` with a new way of specifying dynamic shapes, involving the following concepts:
* a constructor `Dim` for first-class named dynamic dimensions with ranges (similar to `functorch.dim`, and analogous to internal symbolic sizes)
* a mechanism that uses the above in `export` calls to associate inputs to their dynamic shape specifications (`dynamic_shapes`)
Design doc: https://docs.google.com/presentation/d/168U7XK72C_WSsZpGESP6Cho9udh193fi0gfjxCNcJ4E/edit#slide=id.p (Meta-only). Note that we only implement Option 1 in that doc. An older version of this PR also implemented Option 3, which is an alternative way of specifying dynamic shapes using tensor type annotations on the exported callable; but we have moved that to future work for now.
See docs for these new features in `torch.export`. The existing `torch.export.export` is modified to use the new API, `torch._export.export__RC__`, whenever `constraints=None`. We have not deprecated the existing API yet, but will do in a follow-up.
Constraint violation errors arising through use of the new API will now contain suggested fixes using the new API. No longer do we need to report all specializations for static dimensions and suggest all constraints over dynamic dimensions to fix such errors. Instead, due to the redesign, the suggested fixes are much more concise, only involving modifying the definitions of relevant `Dim`s.
Differential Revision: [D48919204](https://our.internmc.facebook.com/intern/diff/D48919204/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108448
Approved by: https://github.com/suo, https://github.com/gmagogsfm
We now have two types of functionalization, C++ Functionalization (through the `Functionalize` dispatch key), and python functionalization (through the `FunctionalTensorMode` torch_dispatch mode).
This means that all higher order ops need custom functionalization rules for the python variant too. I added them here, as well as a helper function `dispatch_functionalize()` - equivalent to `torch.func.functionalize()`, except that it uses `FunctionalTensorMode`.
In theory we could have secretly switched `torch.func.functionalize` to use `FunctionalTensorMode`. This would be BC-breaking, though, since `FunctionalTensorMode` isn't composable with the other functorch transforms (the functorch layer-mode stack doesn't know how to re-order torch_dispatch modes arbitrarily).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108656
Approved by: https://github.com/zou3519
ghstack dependencies: #109024, #109248
Fix: #107315
This PR enables dynamo to trace through the `pytree` API by inlining its functions. In
order to do so, a few details of `pytree` had to be changed.
In summary, this PR:
- Introduces `TreeSpecVariable` for representing `TreeSpec` instances
- Specializes `<type>.__bases__` call, returning a `TupleVariable`
- Enables the call to `id` builtin function for every variable that implements
`as_python_constant` method
- Specializes `ConstantVariable.call_method` for its (un)flatten functions
- Implements `UserDefinedObjectVariable.as_python_constant`
- Modifies `pytree` by:
- Make `SUPPORTED_NODES` a map of ids (instead of types) to `NodeDef`
- Removed `functools.wraps` function, since it can't be inlined
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108533
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
ghstack dependencies: #109201
Summary:
X-link: https://github.com/pytorch/executorch/pull/359
When exporting using enable_aot (through the torch.export path), we want to lift all constant tensors as buffers to the exported program. The ScalarToTensor pass in EXIR's aten_to_edge passes will create some constant tensors in the graph, so we will need to run a lift_constant_tensors pass afterwards.
Note that this only needs to be applied when exporting using the torch.export path because in the original path, nothing is lifted.
Test Plan: CI
Differential Revision: D49207492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109382
Approved by: https://github.com/cccclai
Previously, the code for passing inputs to exported program was:
```
if kwargs:
return (args, kwargs)
else:
return args
```
However, this causes some inconsistency where if the original input contains args and kwargs, the treespec would be a tuple containing a tuple of arguments, and a dictionary of keyword arguments. But if the original input only contained args, the treespec would just be a tuple of arguments. This inconsistency causes some inconveniences in the runtime.
So I updated the code to just always keep the kwargs around.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109160
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
Summary:
If there's no inline constraints added, just return the original graph.
We want to do this because sometimes this pass mess up the node names,
before we actually fix this, we could make the behavior a bit less buggy
by skipping noop passes.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109395
Approved by: https://github.com/angelayi
Summary:
When we retrace the graph containing constant tensors, they get lifted as buffer inputs.
AotInductor also wants to lift all the constants as inputs.
If we separate the constants as a separate thing, then it adds an additional complexity where we now have to keep track of 3 inputs (params, buffers, constants).
Cons: People might care about specifically what buffers are/are not buffers?
If people want to know specifically which buffers are constants, we can add an additional field in the graph signature to mark this.
Test Plan: CI
Differential Revision: D49153367
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109040
Approved by: https://github.com/zhxchen17
Summary:
Include constants in AOTInductor .so file.
Added some difference:
1) serialize with ctypes instead of the native of torch.storage
2) Use the underlying for_blob instead of from_blob to construct Tensor.
Test Plan:
Unit tests:
```
test/inductor/test_aot_inductor.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108473
Approved by: https://github.com/angelayi
When we retrace the graph containing constant tensors, they get lifted as buffer inputs.
AotInductor also wants to lift all the constants as inputs.
If we separate the constants as a separate thing, then it adds an additional complexity where we now have to keep track of 3 inputs (params, buffers, constants).
Cons: People might care about specifically what buffers are/are not buffers?
If people want to know specifically which buffers are constants, we can add an additional field in the graph signature to mark this.
Differential Revision: [D49017872](https://our.internmc.facebook.com/intern/diff/D49017872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108592
Approved by: https://github.com/zhxchen17
Summary:
This is a prototype for running extern fallback kernels with a host side proxy executor.
Sample of generated cpp wrapper call:
```
at::Tensor buf0; // output buffer
void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0};
int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81};
proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0);
```
- In my current implementation, proxy executor interprets the raw pointers according to the ops schema.
This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption)
- I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels.
- These are all the input argument types I have support so far.
union Argument {
# Bool value does not matter
1: bool asNone;
2: TensorArgument asTensor;
3: list<TensorArgument> asTensors;
5: i64 asInt;
7: list<i64> asInts;
8: double asFloat;
9: list<double> asFloats;
10: string asString;
10.5: list<string> asStrings;
11: SymIntArgument asSymInt;
12: list<SymIntArgument> asSymInts;
13: ScalarType asScalarType;
14: MemoryFormat asMemoryFormat;
15: Layout asLayout;
16: Device asDevice;
17: bool asBool;
18: list<bool> asBools;
}
- Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC implications.
1. requires exported fx graph to explicitly populate default values, if users doesn't specify.
2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify.
3. Proxy executor look up from opSchema for default values.
For fixing T162112344
Test Plan:
frontend:
buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main
test:
buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops
backend:
buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main
buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)'
Reviewed By: suo
Differential Revision: D48747417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350
Approved by: https://github.com/izaitsevfb