Summary:
We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly.
- _fakify_script_objects in `compile_fx`
- Allow fake torchbind objects in `torchbind_constants`
Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens.
Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API.
Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`.
Test Plan:
```
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind
buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc
```
Differential Revision: D70013257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529
Approved by: https://github.com/angelayi
WHen we create constraints, we look at the ordering of kwargs according to model signature. But when we trace, we use the ordering that is created based on how user passes in their kwargs. As a result, constraints and dynamic shapes end up having a different order causing issues when they have different dynamic tensor specs.
Differential Revision: [D71478578](https://our.internmc.facebook.com/intern/diff/D71478578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149528
Approved by: https://github.com/ydwu4
Summary: Remove torch.export.export_for_inference, it is redundant and can always be replaced with torch.export.export_for_training() + run_decompositions()
Test Plan: unit tests
Differential Revision: D71069057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149078
Approved by: https://github.com/tugsbayasgalan
Summary:
Recall that we use "ivals" to track intermediate values of mutations during unflattening. Previously, for each such intermediate value, we would create a hidden shared attribute that would be updated / read by respective submodules.
Unfortunately this scheme doesn't work when some but not all of those submodules are swapped out. This is because the swapped in submodules have no knowledge of these hidden attributes. Thus the submodules that are not swapped out end up reading / updating dangling state.
This PR does away with these hidden attributes. Instead, we directly read the underlying buffer or placeholder that was updated, and update those underlying buffers and placeholders in place. This makes the graphs look much closer to their eager origins.
Test Plan: added some tests, ensured existing tests pass
Differential Revision: D71203469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149206
Approved by: https://github.com/tugsbayasgalan
Minor refactor to trace.py
* Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export`
* Matched the APIs of `_strict_export` and `_non_strict_export`
* Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240
Approved by: https://github.com/pianpwk
Summary:
- Flip the default value of strict argument in torch.export.export from True to False
- Update test infra to cope with the change, some of them made the assumption of strict mode as default
- Disabled some tests that fail in non-strict mode
Test Plan: Sandcastle
Differential Revision: D70228628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148790
Approved by: https://github.com/angelayi
# Fix typo errors across PyTorch codebase
This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.
## Changes Made
### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
- `setup.py`: Build system documentation
- `torch/_library/triton.py`: AOT compilation comments
- `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
- `torch/export/_unlift.py`: Pass population comments
- `torch/export/exported_program.py`: Decomposition table notes
### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
- `test/mobile/test_lite_script_module.py`: Exception handling comments
- `torch/export/_draft_export.py`: Error message text
- `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
- `torch/csrc/utils/python_numbers.h`: Overflow handling comment
- `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
- `torch/_dynamo/symbolic_convert.py`: Error explanation
### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
- `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
- `torch/distributed/distributed_c10d.py`
## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.
## Test Plan
No testing required as these changes only affect comments and documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Plan: avoid the use of unbacked renamings, and introduce a pass run in `_produce_aten_artifact` that recomputes unbacked bindings. Decided to do this because in we don't serialize unbacked renamings (or any ShapeEnv state), so this used to compose poorly with de/serialization. This hopefully establishes the invariant that the unbacked binding keys are always in sync with the example values (i.e. same indices, and removed if the symbol is replaced / specialized).
For de/serialization, we don't stored unbacked bindings, and just rerun the pass.
Involved a refactor of compute_unbacked_bindings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147574
Approved by: https://github.com/avikchaudhuri
Summary: This matches the export API. To print the report, people can just do `print(ep._report)`. This information is also displayed in the terminal after the draft_export call.
Test Plan: CI
Reviewed By: SherlockNoMad
Differential Revision: D69689154
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147558
Approved by: https://github.com/pianpwk
Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid
Differential Revision: D69937387
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535
Approved by: https://github.com/yiming0416
Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land.
Summary:
As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.
In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.
Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary.
Test Plan:
Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522).
```
buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops
```
Differential Revision: D69625112
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153
Approved by: https://github.com/manuelcandales
Summary: Add support for inputs that no longer exist in `input_fields`, but is not actually used by the original program. In this case, we just give it a dummy input based on the node's metadata.
Test Plan: Verified for S488841
Differential Revision: D69328093
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147238
Approved by: https://github.com/pianpwk
Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows:
```python
if key == "missing_fake_kernel":
return hash((key, data["op"])) # Same ops get deduped
elif key == "mismatched_fake_kernel":
return hash((key, data["op"], data["reason"])) # Same op and reason for errors get deduped
elif key == "propagate_real_tensors":
return hash((key, json.dumps(data["stack"]))) # Guards appearing on the same stacktrace get deduped
elif key == "create_unbacked_symbol":
return hash((key, json.dumps(data["stack"]))) # Unbacked symbols appearing on the same stacktrace get deduped
```
Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this.
The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146533
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #146532
As upsample_bilinear2d.vec is a core ATen op, it should not be decomposed by default in the export path. Because the operator has CompositeImplicitAutograd dispatch, its decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.
In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining three ops: upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.
Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and un-decomposite it, but this is no longer necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141791
Approved by: https://github.com/tugsbayasgalan, https://github.com/digantdesai
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).
This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.
Test Plan: added test
D68880766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
Adds `dtrace_structured` logging so when a guard or real-tensor propagation assert is added, the relevant user code with local symbolic values & free symbols are logged, e.g. from the draft export CLI report (soon to be added to tlparse):
1. Guard added:
```
1. Constraint violation error.
The specified input dynamic_shapes spec was found to be incorrect during tracing.
Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}.
This occured at the following stacktrace:
File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 267, in forward:
assert a.shape[0] == 3
Locals:
a: Tensor(shape: torch.Size([s0, 3]), stride: (3, 1), storage_offset: 0)
Symbols:
s0: L['args'][0][0].size()[0]
...
```
2. Real tensor propagation:
```
1. Data dependent error.
When exporting, we were unable to evaluate the value of `u2 < 0`.
This was encountered 8 times.
This occurred at the following stacktrace:
File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 217, in forward:
return res[:c_item]
Locals:
res: Tensor(shape: torch.Size([u0, u1]), stride: (Max(1, u1), 1), storage_offset: 0)
c_item: u2
...
```
Currently the values are extracted from the traceback, and are only valid for non-strict; strict seems to require storing & fakifying locals in the frames reporting by `TracingContext`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143378
Approved by: https://github.com/avikchaudhuri, https://github.com/bobrenjc93
Previously, in non-strict path, we always error when trying to inplace update a constant tensor because those constant tensors are not actually wrapped by functional tensors. This is correct behaviour in torch.compile, because dynamo makes all constant tensors into buffers and AOTDispatcher just lifts them and wraps them in functional tensors. However, in non-strict, there is no such step that registers constants as buffers so AOTDispatcher panics when it sees these dangling constant tensors when functioanalizing.
Due to recent change in the IR, this is no longer an issue in non-strict path because we don't call AOTDispatcher at training IR level, but now it is a problem for both strict and non-strict when we lower to inference. (lowering to inference is very similar to non-strict tracing) As a result, we have at least one external (https://github.com/pytorch/pytorch/issues/141336) and internal issues reported due to this difference.
To fix this, there are two ways:
1. Make functionalization be aware of constant tensors and map them to functional tensors on the fly. This makes functionalization invariant uglier and could potentially open up a gate for more nasty bugs.
2. Special handle this in export. This seems more aligned with what dynamo does today so i think we should do it this way. I think the current state could benefit from more refactors to make the run_deocmpositions to be more similar to strict export (because both of them now handle this constant registerinig logic) but it is bit complicated to do it now because strict export version of this logic is also not complete because it doesn't take into account of export graph renaming pass etc). I will follow up with more refactors after this PR (T213466691) to unblock users faster.
For future reference:
Why are we not doing "turning constants into non-persistent buffers and never de-register"? The reason is because in some internal models, they rely on module.to to reliably work to move params/buffers to correct device. As a result, buffers are moved while constants are not. In composibility meeting, we agreed that export won't do device agnostic tracing going forward (it will provide a way to specify FakeTensor in CPU that can be configured to be run on GPU), so after that is done, we can always turn constants into non-persistent buffers which will simplify export's constant handling.
Differential Revision: [D68610739](https://our.internmc.facebook.com/intern/diff/D68610739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145593
Approved by: https://github.com/avikchaudhuri
If a user passes in a namedtuple as an input, currently the input TreeSpec looks like: `TreeSpec(type=namedtuple, context=”class_fqn”, children_spec=[*, *])`
The user then saves the program containing this input TreeSpec. But what happens if they load it in a new environment where `class_fqn` now contains an additional field?
This means that the exported program is now expected to take in another input. But since those fields were not used in the original program, users should be able just drop those additional fields and the program will run successfully. This is needed/used in APS where they use unflattener's adapter to adapt the inputs based on the previously saved treespecs.
There are a couple of [solutions](https://docs.google.com/document/d/1V4ZSdy-8PUISWc8RqvGu3DU01BVegJhHHPWqa1Io7Eg/edit?tab=t.0) for how we can address this, but eventually we settled on saving a side table mapping namedtuple types to their list of field names, which can then be accessed by the adapter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145956
Approved by: https://github.com/zhxchen17
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).
This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.
Test Plan: added test
Differential Revision: D68880766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
Summary:
maybe this is too much info, but it's difficult to go through old draft export reports where the stack trace is out of sync with the current codebase. Data-dependent errors now look like:
```
2. Data dependent error.
When exporting, we were unable to evaluate the value of `u306`.
This occurred at the following stacktrace:
File /data/users/pianpwk/fbsource/buck-out/v2/gen/fbcode/78204cab86e8a0fb/sigmoid/inference/ts_migration/__pt2i_readiness_main__/pt2i_readiness_main#link-tree/caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/embedding_bag_proxy.py, lineno 109, in _forward_impl:
`if offsets[-1] > len(input):`
As a result, it was specialized to evaluate to `261`, and asserts were inserted into the graph.
Please add `torch._check(...)` to the original code to assert this data-dependent assumption.
Please refer to https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit#heading=h.boi2xurpqa0o for more details.
```
This would be even more helpful for reports on torch-packaged models, but that requires some more work on PT2I-specific stack trace processing
Test Plan: .
Differential Revision: D68534017
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145443
Approved by: https://github.com/angelayi
Fixes#144976
Using appoach ① `IO[bytes]`, but could also try with a protocol.
## Notes:
- moved `torch.serialization.FILE_LIKE` to `torch.types.FileLike`
- Use `FileLike` annotation where it makes sense
- made sure those functions also support `os.PathLike`
- Replaced `isinstance(x, io.BytesIO)` with `isinstance(x, (io.IOBase, IO))` where appropriate.
- Replaced `BinaryIO` with `IO[bytes]` (the two ABCs are almost identical, the only difference is that `BinaryIO` allows `bytearray` input to `write`, whereas `IO[bytes]` only `bytes`)
- needed to make `torch.serialization._opener` generic to avoid LSP violations.
- skipped `torch/onnx/verification` for now (functions use `BytesIO.getvalue` which is not part of the `IO[bytes]` ABC, but it kind of seems that this is redundant, as e.g. `onnx.load` supports `str | PathLike[str] | IO[bytes]` directly...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144994
Approved by: https://github.com/ezyang, https://github.com/Skylion007
Adds unbacked bindings during deserialization. These are carried by a node's metadata, and map pending fresh unbacked symbols to paths to such symbols inside the corresponding example value carried by the node's metadata.
Since it is awkward to serialize paths, we only serialize the names of these symbols and reconstruct the paths on deserialization, using a shape env util. We also need to bump counters for unbacked symbols here, because the shape env util we use to create these symbols (when deserializing example values) don't do so, and not doing so makes later passes (like `run_decompositions`) crash because new unbacked symbols don't get new names.
This is enough for non-strict. For strict, the unbacked bindings and example values in node metadata can get out of sync, because of running AOTAutograd as an additional step after Dynamo. So we have to sync those back.
Differential Revision: [D68232274](https://our.internmc.facebook.com/intern/diff/D68232274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144894
Approved by: https://github.com/pianpwk
Summary:
Add experimental support for torch.nn.Module as input types.
Before this change, we don't support module inputs but recently we saw some interesting use cases like gpt-fast https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L68 where we directly pass in a module input for different variants of the same models.
Since we don't really care about non-param or non-buffer states in non strict mode, we don't care about those either and pretend they are like plain constants during tracing. We treat any module input like a nested container of tensor, and each time we will automatically register a pytree handler for these module types to flatten its state dict into a group of tensors. We will just inline any module method call during tracing like we did for `self` module in export_for_training. This will make input modules' behavior very similar to the training module in typical case, except that we don't record the inputs as parameter or buffers but rather just plain user inputs.
Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_module_input
Differential Revision: D67680827
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143925
Approved by: https://github.com/tugsbayasgalan
Summary:
A reland of https://github.com/pytorch/pytorch/pull/142426.
Copying the description over here:
For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable.
The alternative:
If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because:
it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes.
changes to triton or the serialization logic for triton arguments can be BC breaking
exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction.
Future plans:
After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC.
In the long term, we may export multiple cubins for the triton op directly.
Test Plan: see new tests.
Differential Revision: D67879685
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144284
Approved by: https://github.com/zou3519
When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries.
Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664
Approved by: https://github.com/tugsbayasgalan
We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack).
In increasing-scope:
- `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx.
- `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True.
- `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425
Approved by: https://github.com/avikchaudhuri
Combining several fixes to unflatten for bugs revealed by random graph testing.
The fixes target two categories of bugs:
1. Some bugs show up as exponential blowups for largish system of nn modules. These are fixes by converting lists to sets, using caching, or otherwise rewriting to reuse computation more effiicently.
2. Other bugs were due to missing intermediate modules created when attributes such as submodules and buffers are accessed through longish paths before calling the corresponding intermediate modules, or missing attributes such as buffers and constants in submodules corresponding to multiple calls.
Differential Revision: [D66659795](https://our.internmc.facebook.com/intern/diff/D66659795/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142141
Approved by: https://github.com/ydwu4
Over time, a large number of the existing type ignores have become irrelevant/unused/dead as a result of improvements in annotations and type checking.
Having these `# type: ignore` linger around is not ideal for two reasons:
- They are verbose/ugly syntatically.
- They could hide genuine bugs in the future, if a refactoring would actually introduce a bug but it gets hidden by the ignore.
I'm counting over 1500 unused ignores already. This is a first PR that removes some of them. Note that I haven't touched type ignores that looked "conditional" like the import challenge mentioned in https://github.com/pytorch/pytorch/pull/60006#issuecomment-2480604728. I will address these at a later point, and eventually would enable `warn_unused_ignores = True` in the mypy configuration as discussed in that comment to prevent accumulating more dead ignores going forward.
This PR should have no effect on runtime at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142325
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
Summary: Add the helper function to put a const graph back to the toplevel graph, can be useful when we're taking const graphs from delegates.
Test Plan: CI
Reviewed By: trieuat
Differential Revision: D63031982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140854
Approved by: https://github.com/SherlockNoMad
With largish systems of nn modules with buffers, sinking params suffered from some kind of exponential blowup that is easily fixed by using a set instead of a list to keep track of unlifted buffer placeholders.
Test Plan: added random dag test that failed previously
Differential Revision: D66457661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141494
Approved by: https://github.com/angelayi
For better tracking, we need to make maybe aliasing/mutating ops with proper tag. We need to special case native_batch_norm because it is not a CIA but has a wrong schema. I guess native_batch_norm will be removed at some point, so until then we just keep it around.
D60347117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131990
Approved by: https://github.com/bdhirsh
Handling of nested modules in unflatten had several bugs, which were caught by trying to preserve module call signatures for nested modules.
* A module `k` encountered when calling `k.n()` before `k()` used to become an empty nn module. This caused some information to be dropped when `k()` was eventually called. Relatedly, we would also lose call counts for `k.n()` through different paths (say, when `k()` calls `n()`).
* Deleting call-indexed modules and patching up their call sites was broken for nested modules when creating dispatcher modules, because of silliness when handling their fqns.
An interesting aside is that we used random graph generation for testing some of these changes. A future PR will add the infra to create tests using these random graphs.
Differential Revision: D66192799
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141066
Approved by: https://github.com/angelayi
For custom ops that do not have a meta kernel, draft export automatically creates a meta kernel based on the tracing example inputs. To ensure that these assumptions made during tracing is clear to the user, we add assertions into the traced exported program:
An example graph:
```
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, a: "f32[s0, s1]", b: "f32[s2, s3]"):
# File: /data/users/angelayi/pytorch/test/export/test_draft_export.py:172 in forward, code: res1 = torch.ops.mylib.foo4(a, b)
_assert_tensor_metadata = torch.ops.aten._assert_tensor_metadata(a, dtype = torch.float32, device = device(type='cpu')); _assert_tensor_metadata = None
_assert_tensor_metadata_1 = torch.ops.aten._assert_tensor_metadata(b, dtype = torch.float32, device = device(type='cpu')); _assert_tensor_metadata_1 = None
foo4: "f32[u2, u3]" = torch.ops.mylib.foo4.default(a, b); a = b = None
return (foo4,)
```
Differential Revision: [D66321129](https://our.internmc.facebook.com/intern/diff/D66321129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141072
Approved by: https://github.com/pianpwk
ghstack dependencies: #141071
Currently real tensor tracing raises MetadataMismatchErrors if registered fake kernels don't match the real kernels (e.g. shape, aliasing, dtype, etc.). This adds an option to use fake kernel inference to bypass mismatches - this option defaults to False for real tensor tracing, but is on for draft export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139766
Approved by: https://github.com/angelayi, https://github.com/zou3519
Differential Revision: [D65362160](https://our.internmc.facebook.com/intern/diff/D65362160)
State after this IR:
1. For the tests that require inference IR, they are replaced with ep.run_decomp({}) so export_for_training_run_decomp is sort of redundant but i guess it is still nice that multiple round of retracing still working. In general, we need some auditing to reduce our redundant testing coverages.
2. After this PR landed and not get reverted for a week or so, i will replace the export_for_training calls with export as they are the same thing now.
3. Added more tests to also cover now "deprecated" old IR by patching export to use old export. For reviewers, please look at the internal version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139511
Approved by: https://github.com/ydwu4, https://github.com/angelayi, https://github.com/avikchaudhuri
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
During export, we nub out most CIA ops to return NotImplemented to avoid decomposing them during tracing. To recover the existing shape propagation behavior, we register these CIA decomps directly as FakeTensorMode rules as well. The reason we have to do is because when we return NotImplemented, FakeTensor would fallback to running these CIAs with Meta backend causing device branching CIA ops to fail. (because now the device is Meta. One example is sdpa). If we register a kernel directly to FakeTensorMode, we won't fallback to Meta backend.
Differential Revision: [D65716260](https://our.internmc.facebook.com/intern/diff/D65716260/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140465
Approved by: https://github.com/bdhirsh
Summary: The gm_torch_level can be a _LazyGraphModule(GraphModule) instead of a GraphModule. When we call .recompile(), GraphModule populates the self._out_spec, but _LazyGraphModule(GraphModule).recompile() doesn't populate it.
Test Plan: CI
Differential Revision: D65902135
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140608
Approved by: https://github.com/tugsbayasgalan
Differential Revision: [D65307961](https://our.internmc.facebook.com/intern/diff/D65307961/)
This PR introduces the concept of a "dispatcher" module `n` that carries multiple interpreter modules `n`, `n@1`, `n@2`, etc., each corresponding to a particular call of `n` and thus might carry a different specialized graph. We only do this when we're preserving module call signatures for `n`. The carried modules have the same number and order of calls to `n` appearing in the original module / exported program. In the unflattened module, all those calls go to the "dispatcher" module which internally tracks how many calls have been made so far and invokes the corresponding interpreter module. We reset this tracking after a successful or unsuccessful run of the unflattened module.
Overall this makes swapping easier when module call signatures are preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139439
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #139438
Summary:
Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation).
This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors.
Test Plan: CI
Differential Revision: D65374254
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540
Approved by: https://github.com/pianpwk, https://github.com/zou3519
# Why?
I want the following code to work.
minimal repro:
```
class M(torch.nn.Module):
def forward(self, dilate_flag):
return dilate_flag.item()
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()
ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```
error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program
second error will be handled by https://github.com/pytorch/pytorch/pull/138760
# Motivation
I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.
```
class M(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.dilate_flag = 0
def forward(self, dilate_flag):
self.dilate_flag = dilate_flag.item()
def true_fn(dilate_flag):
return dilate_flag.clone()
def false_fn(dilate_flag):
return dilate_flag.clone()
torch.cond(
self.dilate_flag,
true_fn,
false_fn,
(dilate_flag,),
)
return self.dilate_flag
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()
for input in inputs:
expected_output = model(*input)
ep = torch.export.export(model, input, strict=False)
path = torch._inductor.aot_compile(ep.module(), input)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input)
assert (
expected_output == actual_output
), f"henry they are not equal {expected_output} != {actual_output}"
```
Differential Revision: D64867504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now. I also fixed two bugs:
1. ep.module() doesn't register all aliased constants in the module.
2. When we retrace, we need to fakify the original Torchbind object.
3. We don't run any DCE on training IR so we need to add some more torch ops to verifier.
Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
Summary:
`torch.fx.Interpreter.run()` only takes args as input. Currently we pass kwargs as well which causes errors during retracing.
Flatten the kwargs and concat them with args will solve the issue.
Several previously failing tests under `_retraceability_non_strict` now passes.
Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability_non_strict
```
Differential Revision: D64980053
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138927
Approved by: https://github.com/angelayi
Summary:
Unflatten was broken for HOPs for a couple of reasons:
(1) we didn't expect `get_attr` nodes in the exported program, but they can occur to hold graph arguments to HOPs; such attributes must be moved from the exported program to the corresponding unflattened submodule containing the HOP call.
(2) we don't record metadata for graph arguments on serialization (there's nothing to hold it in our schema), and accordingly the `get_attr` nodes we create on deserialization don't have `nn_module_stack` metadata, which obviously wrecks unflatten.
Test Plan: added a couple of tests
Differential Revision: D65013647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138978
Approved by: https://github.com/zhxchen17
As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures).
This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead.
Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition.
Differential Revision: D64806175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669
Approved by: https://github.com/tugsbayasgalan
Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints.
The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid.
Current behavior is that:
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x - y
inputs = (torch.randn(4, 4), torch.randn(4, 4))
shapes = {
"x": (Dim.AUTO, Dim("d1", min=3)),
"y": (Dim("d0", max=8), Dim.DYNAMIC),
}
ep = export(Foo(), inputs, dynamic_shapes=shapes)
```
The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant.
Differential Revision: D64636101
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490
Approved by: https://github.com/avikchaudhuri
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
In this PR, we implement lazy dictionary for export decomp behaviour for following reasons:
1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible.
I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions)
Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650
Approved by: https://github.com/justinchuby, https://github.com/bdhirsh
Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places.
Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export.
Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.)
Differential Revision: D64406945
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999
Approved by: https://github.com/tugsbayasgalan
Summary:
Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs.
Longer term maybe we should pull this fakification up in non-strict.
Test Plan: added test
Differential Revision: D64480272
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091
Approved by: https://github.com/tugsbayasgalan
We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call.
Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.)
Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls.
Test Plan: Like D64014936
Differential Revision: D64136277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646
Approved by: https://github.com/angelayi
Resolve#137540
Summary:
We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks.
For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks.
To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict().
Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically.
nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy.
One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have mod.layers[3].layer.sa_norm.scale.
But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"].
In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy.
Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly.
In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict.
The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks.
The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state.
If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks
```
Differential Revision: D64080561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609
Approved by: https://github.com/angelayi, https://github.com/pianpwk
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants.
Test Plan: fixed test
Differential Revision: D64081786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547
Approved by: https://github.com/tugsbayasgalan
Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other).
Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten.
Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191
Approved by: https://github.com/avikchaudhuri
Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash.
Test Plan: added a test
Reviewed By: angelayi
Differential Revision: D63913750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363
Approved by: https://github.com/angelayi
Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed.
Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export`
Differential Revision: D63802576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240
Approved by: https://github.com/tugsbayasgalan
Summary:
When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`.
I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call.
We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call.
Test Plan: test_export
Differential Revision: D63773534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230
Approved by: https://github.com/avikchaudhuri
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.
Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example.
```
class N(torch.nn.Module):
def forward(self, x, b):
if b:
return x + 1
else:
return x + 2
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.n = N()
def forward(self, x):
x0 = x + 3
x1 = self.n(x0, True)
x2 = x1 + 4
x3 = self.n(x2, False)
return x3 + 5
m = M()
inp = (torch.ones(1),)
print(m(*inp)) # tensor([16.])
ep = torch.export.export(m, inp)
print(ep.module()(*inp)) # tensor([16.])
unflattened = torch.export.unflatten(ep)
print(unflattened(*inp)) # tensor([15.])
```
However, this goes against the spirit of specializing graphs when exporting: we should *expect* that for every call to a submodule we *might* generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule.
The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected.
However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter.
A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case.
Differential Revision: D63642479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013
Approved by: https://github.com/pianpwk
Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620.
This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`).
On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. 7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)
One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked.
Differential Revision: D63358628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591
Approved by: https://github.com/avikchaudhuri
Summary: Title
Test Plan: CI
This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp
Differential Revision: D63297050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492
Approved by: https://github.com/bdhirsh
Summary:
# context
* for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/)
* basica idea of this diff is to **short circuit the pytree flatten-unflatten function pairs** between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict.
NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545}
* short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup.
* hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users.
# details
* The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC.
* a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns.
WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`.
# additional changes
* absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`.
* set `graph.owning_module` in export.unflatten as required by the graph modification
* add one more layer of `sparse_module` for closely mimicing the APF model structure.
Test Plan:
# run test
* serializer
```
buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer
```
* apf
```
buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir'
```
* local mp run
```
==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ====
finished
test_mtml_instagram_model_562438350_single_gpu_with_ir
Imports took: 6.0s! Profile with --import-profiler. --_ |""---__
Executed 1 example in 203.1s: |'.| || . """|
Successful: 1 | || || /|\""-. |
Failed: 0 | || || | | |
Skipped: 0 | || || | \|/ |
Not executed: 8 |."| || --"" '__|
https://testslide.readthedocs.io/ --" |__---"""
```
Differential Revision: D62606738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045
Approved by: https://github.com/angelayi
In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it.
After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag.
Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080
Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh
Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`.
Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536
Approved by: https://github.com/avikchaudhuri
Summary:
Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module.
This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030
Test Plan: CI
Differential Revision: D60939034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996
Approved by: https://github.com/pianpwk
- The new implementation (auto_functionalized_v2) is enabled by default but can be disable
using an inductor flag.
- In export mode the old implementation is used.
**Motiviation**
Previous functionalization fails to re-inplace arguments when they are view over other tensors.
see issue https://github.com/pytorch/pytorch/issues/131192
The new functionalization is easier to re-inplace for views.
**A) Functionalizations pass**
consider a program:
```
func(t)
x = t[0]
y = t[1]
foo(x, y) # custom operator with x, y mutable
return (x, y, t)
```
- To functionalize `foo` we generate a function that operates on the base tensors of the inputs; (x.base() and y.base())
and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())```
- Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function.
```
auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default,
_x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 ,
_y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1 ,
_all_bases = [arg0_1])
```
- In the code above:
- _all_bases[t]: refers to a unique set of bases for all foo arguments.
- for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1) regenerate x from _all_bases[_x_base_index] or a copy of a the base.
- the output of auto_functionalized is foo output , followed by x tensors one for each base in _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base.
- for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output
for the function above after functionalization we get :
```
def forward(self, arg0_1: "f32[2][1]cpu"):
auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1])
getitem_1: "f32[2][1]cpu" = auto_functionalized[1]; auto_functionalized = None
copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1); arg0_1 = copy_ = None
# No stacktrace found for following nodes
select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0)
select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1); getitem_1 = None
return (select_2, select_3)
```
**B) Semantics of auto_functionalize**
The new semantics of auto_functionalize is as the following:
1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it)
2. For each arg, regenerate the arg from the copy of its base using the view information above.
3. return the original foo output followed by the new bases.
**C) Re-inplace pass**
since auto_functionalize not copy the bases, what we actually inplace is the bases.
(run just like before but on the beses instead of args).
1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409
Approved by: https://github.com/zou3519
Summary:
A bit of refactoring to prepare to remove `None` as a way to specify static dimensions in dynamic shapes, given we already have `Dim.STATIC` for the same purpose. We will now warn whenever this happens. However no tests were modified because problematic uses of `None` still need to behave as they do today, until we are ready to remove support. It should be easy to port tests by replacing the warning function to raise instead.
Note that other uses of `None`, such as for entire values (tensor or non-tensor) remain as is. Moving forward this should be the only purpose of `None` (at least externally).
Finally, there's a bit of confusion in our representation now because `AUTO` also internally transforms to `None`. Renamed dynamic_shapes to transformed_dynamic_shapes where this happens. Overall the two forms (pre and post transformation) have different properties so should probably not be represented in the same format in the future.
Test Plan: existing
Differential Revision: D62040729
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134877
Approved by: https://github.com/pianpwk
Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph.
Test Plan:
buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled
Differential Revision: D62014035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830
Approved by: https://github.com/angelayi
Fixes#133252
In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks.
Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500
Approved by: https://github.com/angelayi
Summary:
With training IR, we cannot rely on trapping `to()` in `FunctionalTensor` because the regular decomposition kicks it first, and that can cause it to be optimized away.
So instead we preserve it until we functionalize, and then replace it explicitly with `_to_copy()`.
Test Plan: expected test failures go away
Differential Revision: D61883878
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134622
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things.
Test Plan: existing
Differential Revision: D61807361
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484
Approved by: https://github.com/angelayi
Summary: apparently DIM.AUTO leads to duck sizing, I didn't catch this. Doing the least intrusive fix possible by using `torch._dynamo.maybe_mark_dynamic()` under the hood.
Test Plan: added test
Differential Revision: D61809344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134486
Approved by: https://github.com/avikchaudhuri
Summary:
Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup.
This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS` enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.:
```
TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ...
```
This will show logs with something like:
```
...
prim::device called at .../example.py:4284 in foo
TensorBase.item called at .../example.py:4277 in bar
...
```
We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late.
Test Plan: ran it on some sample commands
Differential Revision: D61692156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298
Approved by: https://github.com/angelayi
Refactors construction of ExportGraphSignature object for export & training IR, explicitly creating AOTAutograd signature for training IR. This will be helpful for upcoming refactors for placeholder naming & runtime asserts prettifying.
Changes:
- dedups `make_argument_spec` call, moved to export/graph_signature.py
- `_sig_to_specs` wrapped into new function `_convert_to_export_graph_signature`, directly converts GraphSignature -> ExportGraphSignature
- `_make_fx_helper` explicitly creates AOTAutograd GraphSignature object
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134059
Approved by: https://github.com/angelayi, https://github.com/ydwu4
Starter version of automatic dynamic shapes for export.
Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining".
The usage for specifying `dynamic_shapes` is now:
```
AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static
None/int/STATIC -> static
Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified.
```
Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic:
- specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None})
```
Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic:
- The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related
```
Implementation details:
This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints.
For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}`
This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are:
```
None/unspecified: dynamic by default
Dim/DerivedDim: also a strict assertion
```
If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620
Approved by: https://github.com/avikchaudhuri
Part of #134054.
This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
Summary:
Previously, reuse of the same `Dim` was encoded by "sharing" internal constraints among constraint targets. This kind of sharing, implemented using `shared` fields between `_Constraint`s, was originally motivated by `dynamic_dim`, specifically to support `==` between `dynamic_dim`s, but we no longer need to maintain this overcomplicated structure: we can simply use names of `Dims` to directly encode sharing information.
Thus this PR vastly simplifies the structure of `_Constraint` by removing `shared` fields. As a result, both `_Constraint` and its moral subclass, `_DerivedConstraint`, are 1-1 with `Dim` and its moral subclass, `DerivedDim`.
Note that this will break `==` over `dynamic_dim`, so an immediate follow-up will be to remove `dynamic_dim` entirely from our public API. (It's been more than 6 months since the deprecation warning anyway.) I just didn't want to deal with that process in the same PR.
Test Plan: existing
Differential Revision: D61559413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134045
Approved by: https://github.com/pianpwk
Summary:
In export, we will generate many redundant getitem nodes branching from the same source, inserted by runtime assertions or any passes. This is causing issues with any downstream system relying on any value being uniquely defined by a single node.
I don't think it hurt to remove a bunch of getitem nodes only, so I just added to the ctor.
Test Plan:
rebase on D61256937
```
buck2 run scripts/bearzx:pt2_export_playground
```
Differential Revision: D61351578
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133618
Approved by: https://github.com/tugsbayasgalan
Summary:
Skip re-exporting modules with the duplicated types to speed up the exportability tests.
In real models, there are many duplicated modules, and mostly have the same export issues.
Test Plan: Existing CI
Differential Revision: D61504630
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930
Approved by: https://github.com/angelayi
Co-authored-by: bearzx <bearzx@fb.com>
Summary: `_ConstraintTarget` is an internal data structure that has some redundancy: tensors are identified by their id but also carry a weak reference. The weak reference was probably useful a year back but everything is done with ids right now, and the lifetime of these tensors ensures that using their ids is OK.
Test Plan: existing tests
Differential Revision: D61488816
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133890
Approved by: https://github.com/tugsbayasgalan
Sorryyyyy for another refactor. This splits `_process_dynamic_shapes` into 3 parts:
1. `_combine_args` - mostly the same thing
2. `_check_dynamic_shapes`, which is responsible for raising 99% of UserErrors if the dynamic shapes spec is invalid (minus 1 UserError with DerivedDims)
3. `_process_dynamic_shapes`, which for now, is the same thing, minus the stuff in 2.
This refactor is helpful for incoming automatic dynamic shapes work, because, we're switching to `assume_static_by_default=False`, which is what `_dynamo.export` currently does. This means any unspecified dims are allocated a symbol, in contrast to export today which keeps unspecified dims static. Historically this has been desirable - export users don't want too much dynamism. So we want to change how the spec is translated into constraints.
This means when we switch over to automatic dynamic shapes, we want to plug in something in between steps 2. and 3. which patches up the spec for `assume_static_by_default=False`, filling in static shapes for any unspecified dims, and potentially clearing out the auto-dynamic dims (since they're no-ops). We would do this in-between 2. and 3. to keep `_process_dynamic_shapes` semantically the same, since it's used with `_dynamo.export`.
We could do this without a refactor, plugging in this transform before `_process_dynamic_shapes`, but since that function's responsible for both spec checking + constraint production, moving spec checking to before we transform the specs helps guarantee we're raising errors on what the user's specified, and not an internal export bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133391
Approved by: https://github.com/avikchaudhuri
Summary:
Add a special field in Graph and Node level metadata called "custom" which should be mapped to a json-serializable object, and we guarantee this field should be always preversed across the following transformations:
1. copy/deepcopy
2. run_decompositions()
3. serialization
4. re-exporting
Test Plan: :test_export -- -r custom_tag
Reviewed By: angelayi
Differential Revision: D60291839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131912
Approved by: https://github.com/angelayi
In joint-graph export we have a `copy.deepcopy(ep.graph_module)` call. This turns out to be an imperfect deepcopy, because deepcopy allows objects to overwrite their `__deepcopy__` methods. For fx.Graph, this ends up deferring to `Graph.create_node()`, which checks the graph namespace, and can avoiding copying the exact name in niche examples, like where the name is a Python keyword (e.g. `input` gets renamed to `input_1`).
Names like `input` happen because export's placeholder naming pass overwrites what the namespace creates, based on the model's `forward()` signature. So we can either 1) avoid overwriting such cases, which requires rewriting the naming pass logic, or 2) force another overwrite after deepcopying. This goes with 2).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133269
Approved by: https://github.com/zhxchen17, https://github.com/dvorjackz, https://github.com/ydwu4
Summary:
Fixes T198245910.
In previous diff D60532628 that causes the test failure, we fix the in-consistency caused by constant tensors is accidentally reigistered as buffer by deleting the buffer and re assign them as constant.
However, this broke several existing tests in pyspeech when the exported program is re-traced with torch.jit.trace (which is an anti-pattern we probably should have some alignment), the jit tracer finds this constant tensor requiring grad and errors out.
This PR force constant attr not requiring grad, which is the correct behavior. A better fix is finding out where the constants are created in user code and why it requires grad. But this has low roi so we warn user about it.
Test Plan: See failures in T198245910.
Differential Revision: D60974869
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133031
Approved by: https://github.com/angelayi
Summary: When PyTree detects a structural mismatch between inputs and dynamic shapes, the error messages are quite horrible. This PR fixes these error messages by adding, for each kind of error, the path to the point where the error happens and an actionable reason for the error.
Test Plan: added test with several cases
Differential Revision: D60956976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132982
Approved by: https://github.com/yushangdi
Summary:
A re-land of D60006710.
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.
edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.
The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design
edit 2: Also fix the inconsistency of graph signatures when param_constant is marked as lifted_tensor_constants but it's registered as parameters in the output of ep.module().
Differential Revision: D60532628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132307
Approved by: https://github.com/zhxchen17
Summary:
- make default DCE pass check schema,
- need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added).
- mark Proxy dump as NotImplemented for better error msg
- Remove Proxy from tensors when dumping models, as Proxy cannot be dumped.
More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing.
Test Plan:
CI
```
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r qat_conv2d
- test_export.py
- buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export
- buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r dce
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_fold_bn_erases_bn_node
```
Reviewed By: angelayi
Differential Revision: D60319175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764
Approved by: https://github.com/angelayi
Summary:
Reland of D60206382.
Suggested in https://github.com/pytorch/pytorch/issues/128394.
If there's an autocast context manager, the predispatch (strict) graph can look something like:
```
class <lambda>(torch.nn.Module):
def forward(self, x: "f32[1]"):
...
_enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None
_exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None
return (mm_1,)
```
But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.
Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.
Test Plan:
CI
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast"
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_set_grad"
```
Verified that now we can export the llama model in gh issue 128394 and the gemma model in gh issue 131829 without error.
Differential Revision: D60770038
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132677
Approved by: https://github.com/angelayi
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.
If there's an autocast context manager, the predispatch (strict) graph can look something like:
```
class <lambda>(torch.nn.Module):
def forward(self, x: "f32[1]"):
...
_enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None
_exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None
return (mm_1,)
```
But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.
Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.
Test Plan:
CI
```
parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```
Reviewed By: angelayi
Differential Revision: D60206382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work.
Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export"
Reviewed By: zhxchen17
Differential Revision: D60528900
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437
Approved by: https://github.com/Skylion007
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.
If there's an autocast context manager, the predispatch (strict) graph can look something like:
```
class <lambda>(torch.nn.Module):
def forward(self, x: "f32[1]"):
...
_enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None
_exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None
return (mm_1,)
```
But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.
Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.
Test Plan:
CI
```
parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```
Reviewed By: angelayi
Differential Revision: D60206382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
Summary:
- moves logging functionalities into `torch/_export/db/logging.py` file.
- add a check in `_dynamo/eval_frame.py` to check for optional input and error out with `UnsupportedError`
- change the case name of `torch_sym_int` to `unsupported_operator`
- Check if the case name is registered in exportdb, if so, we give a link to the case in exportdb.
- TODO: add test
Test Plan:
CI
Running the example in https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input gives the following error logging:
```
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] Parameter y is optional with a default value of tensor([[-0.1633, 1.2414, -0.1071],
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] [-0.1936, -0.9425, -0.0824]])
E0730 10:53:33.688000 4155538 torch/export/_trace.py:1043] See optional_input in exportdb for unsupported case. https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input
......
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/389acaeb40d57230/tutorials/pytorch/nntest/__torchtest__/torchtest#link-tree/torch/_dynamo/eval_frame.py", line 1091, in produce_matching
raise Unsupported(
torch._dynamo.exc.Unsupported: Tracing through optional input is not supported yet
```
It also logs a `export.error.classified` event in Scuba.
Reviewed By: zhxchen17
Differential Revision: D60427208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132420
Approved by: https://github.com/zhxchen17
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
Summary:
Fixes https://github.com/pytorch/pytorch/issues/130379.
The original error is verifier finds that the placeholder nodes' meta[''val"] are missing in subgraph of WrapSetGradEnabled hop.
In this PR, we fixed it by re-ordering the replace_set_grad_with_hop_pass with lift_constant_tensor pass because only after lift_constant_pass, all the constant attrs start to have meta["val"].
Test Plan: buck2 test test:test_export -- -r "test_setgrad_lifted_tensor"
Differential Revision: D60244935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131787
Approved by: https://github.com/yushangdi
Suggests fixes for data-dependent errors in non-strict export.
Any data-dependent error has an unresolved condition on unbacked symints. A mechanizable strategy for fixing such errors, which this PR enables, is to "bash" them using `torch._check()`s. For each error we suggest using `torch._check()` on the condition or its negation. The user selects and copy-pastes the suggested fix and continues.
For example, here's an existing data-dependent error message with the suffix following `<snip>...</snip>` added by this PR:
```
Could not guard on data-dependent expression Eq(u2, u1) (unhinted: Eq(u2, u1)). (Size-like symbols: u1)
<snip>...</snip>
User code:
File "test/export/test_export.py", line 1944, in forward
return r.view(items[0], items[2])
Suggested fixes (please choose one of the following):
1. torch._check(items[2] == r.shape[1])
2. torch._check(items[2] != r.shape[1])"
```
Tests in this PR illustrate this workflow, by taking common examples of data-dependent errors and bashing them until success, purely based on suggested fixes. In particular, we test this workflow on the "puzzlers" in https://www.internalfb.com/intern/anp/view/?id=5330476 (thanks @ezyang).
In terms of implementation, we focus on non-strict mode, where we can intercept torch function calls to install a handler that walks up the stack from the error, finding the closest non-torch frame and inspecting its locals for symints appearing in the error. The suggested fixes then access these symints through the local variables so that they can be (a) easily understood by the user (b) directly added to the code.
Implementing this idea in strict mode is follow-up work—we have already investigated what it would take, and decided to separate it out of this PR for reasons described next.
It's not too hard to map symints to locals in Dynamo (although it needs to happen elsewhere, i.e., intercepting torch function calls won't work). However, unfortunately this doesn't seem to be enough; the graph modules created by Dynamo when going through AOTAutograd can raise further data-dependent errors in some cases, and thus we need yet another mechanism to map symints to locals for graph modules, via captured source-level metadata and FX node walking. This latter component will require some care to build properly, or we might conclude it is altogether unnecessary and fix Dynamo instead.
Differential Revision: D56867432
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125378
Approved by: https://github.com/ezyang
Summary:
Dynamo doesn't track whether buffers are `persistent`. This led to some ugly code where we would mark buffers as always persistent when creating signatures, then later check whether the buffers were not in the state dict to infer whether they were non-persistent, and use this to fix up the signature.
This PR instead defines a utility to look up all the non-persistent buffers registered inside a module (this information is recorded in a private `_non_persistent_buffers_set` module attribute), and uses it to (a) correctly set the persistent flag on buffers when creating signatures (b) transfer this information to a Dynamo-traced graph module, which then causes non-persistent buffers to (correctly) not show up in the state dict.
Test Plan: existing tests + new case with non-persistent buffer in nested module
Differential Revision: D60224656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131756
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
Summary:
This pr fixes all the places in strict export stack where the output node's meta is not preserved correctly. However, we're getting a new error for the test we intend to fix: `buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"`:
The `get_attr` nodes has wrong metadata. I guess there are more things need to be fixed to get it working but it's beyond the scope of this PR.
Test Plan: buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"
Differential Revision: D60198221
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131706
Approved by: https://github.com/yushangdi
Summary:
When importing `_trace.py`, put `torch._dynamo.exc.Unsupported` in the global variable ``_ALLOW_LIST`` can cause import to ``export/_trace.py`` to fail with error:
ValueError: Artifact name: 'graph_breaks' not registered, please call register_artifact('graph_breaks') in torch._logging.registrations.
The error is directly raise on line `graph_breaks_log = torch._logging.getArtifactLogger(__name__, "graph_breaks")` in `_dynamo/exc.py`. I've checked that ``register_artifact('graph_breaks')`` does already exist in torch._logging.registrations.
Explicitly call `import torch._logging` doesn't fix the issue.
(see T196719676)
We move ``_ALLOW_LIST`` to be a local variable.
Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_model_for_disagg_acc (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'
buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_test_dsnn_module (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'
Differential Revision: D60136706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131523
Approved by: https://github.com/zhxchen17
Summary:
Previously it was unclear what `_convert_input_to_fake` actually does (used in strict), and in particular how it is different from `make_fake_inputs` (used in non-strict).
This PR splits that function to work purely on user inputs, then renames it to `extract_fake_inputs` and adds a comment clarifying what it does—namely, it extracts fake inputs from a given graph module instead of "converting inputs to fake inputs" (as suggested by the current name) or "making fake inputs" (as happens in non-strict, where no tracing has taken place yet).
The remainder of that function used to also fakify params and buffers. It turns out that this part is identical to what happens in non-strict, hence we also pull `make_fake_inputs` out from `non_strict_utils` into `_trace`, merge it with another util, and make both modes call it.
Test Plan: existing tests
Differential Revision: D60084442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131421
Approved by: https://github.com/zhxchen17
Summary:
- Log export errors to Scuba and mark them with "classified" and "unclassified"
- Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute
- Add `case_name` for some exceptions.
Test Plan:
Running the code below logs a classified error to `torch_export_usage` table in Scuba.
```
import torch
from torch._export.db.case import SupportLevel
class TorchSymMin(torch.nn.Module):
"""
torch.sym_min operator is not supported in export.
"""
def forward(self, x):
return x.sum() + torch.sym_min(x.size(0), 100)
example_args = (torch.randn(3, 2),)
tags = {"torch.operator"}
support_level = SupportLevel.NOT_SUPPORTED_YET
model = TorchSymMin()
torch.export.export(model, example_args)
``
Differential Revision: D59981459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327
Approved by: https://github.com/zhxchen17
Summary:
This diff reverts D59561509
D59561509: [FX][export] DCE pass, check schema for node impurity (#130395) by yushangdi causes the following test failure:
Tests affected:
- [cogwheel:cogwheel_mtia_cmf_m5_shrunk_test#test_flow_with_verification](https://www.internalfb.com/intern/test/844425041436985/)
Here's the Multisect link:
https://www.internalfb.com/multisect/6533402
Here are the tasks that are relevant to this breakage:
T191383430: 10+ tests unhealthy for ads_mtia_inference
The backout may land if someone accepts it.
If this diff has been generated in error, you can Commandeer and Abandon it.
Test Plan: NA
Differential Revision: D60029318
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131341
Approved by: https://github.com/angelayi
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.
**edit:** also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.
The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990
Approved by: https://github.com/pianpwk
Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes.
For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert.
```
x = torch.randn(4, 8) # [s0, s1]
y = torch.randn(32) # [s2]
out = x.reshape(-1) + y
# this emits Eq(s0 * s1, s2), and we represent y's shape as [s0*s1] in the graph.
```
However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization.
We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring.
Follow up:
Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0*s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775
Approved by: https://github.com/avikchaudhuri
Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch.
Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization
Differential Revision: D59825148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851
Approved by: https://github.com/angelayi
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
Summary: Uses original ExportedProgram constants and graph signature to inform decompositions, so that constant tensors and non-persistent buffers are respected for training IR. Removes 7 test failures for training IR.
Test Plan: test_export
Differential Revision: D59820909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130864
Approved by: https://github.com/angelayi
Summary: Adds non-strict implementation of training IR export. Any expected non-strict training IR failures are also either existing strict training IR or non-strict failures (no new failures added). 4 strict training IR failures also resolved.
Refraining from unifying export/export_for_training, per @ydwu4's feedback :)
Test Plan: added test_export_training_ir_to_run_decomp_non_strict.py for non-strict training IR
Differential Revision: D59349454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130062
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects.
For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully.
A call_function node is impure if it has at least one mutable argument.
Fixed the tests below:
test_to_module_with_mutated_buffer_multiple_update_sub_later
test_export_input_mutation_static_shape
test_buffer_util
Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552
Approved by: https://github.com/pianpwk
Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field.
Test Plan: CI
Differential Revision: D59408546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364
Approved by: https://github.com/angelayi, https://github.com/ydwu4
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.
Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs
Reviewed By: angelayi
Differential Revision: D59603147
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train)
Summary:
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Test Plan:
contbuild & OSS CI, see 940e4477ab
Original Phabricator Test Plan:
Imported from GitHub, without a `Test Plan:` line.
Differential Revision: D59543603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380
Approved by: https://github.com/izaitsevfb
Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`.
On a high level, the issue is caused by not detecting fake_mode when there's no input.
Change plan:
1) we add a `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`.
2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg.
3) If the input is a graph module, then we can traverse the graph and check.
4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type.
Fixes#129927
Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo.
Change plan:
check `gm_torch_level` graph's node meta "example_value" for fake mode in addition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928
Approved by: https://github.com/angelayi
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
# something with _w ...
# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang