A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.
This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.
I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:
```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```
Potential later follow ups:
* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
To fix data-dependent errors we want to recommend that people use `torch._check*` APIs. The `constrain_as*` APIs should be fully subsumed by them, and in the future we should kill them entirely.
Differential Revision: D56774333
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125253
Approved by: https://github.com/ezyang
Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258
Approved by: https://github.com/eellison
This completely subsumes https://github.com/pytorch/pytorch/pull/120816
This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain.
Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops!
So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point.
There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway.
**torch/_dynamo/output_graph.py** and **torch/_inductor/lowering.py** - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar` entirely; we assume that ShapeEnv will be good enough to capture all of these.
**torch/_inductor/codegen/wrapper.py** and **torch/_inductor/ir.py** - Add a way to codegen sizevars without forcing simplification
**torch/_inductor/graph.py** - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes.
**torch/fx/experimental/symbolic_shapes.py** - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py
**torch/fx/passes/runtime_assert.py** - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this.
TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874
Approved by: https://github.com/jansel
ghstack dependencies: #124864
We guard on key order
1) When a key is a non-constant object
2) When we actually need key order - like .values, .items etc
For dicts/OrderedDicts that do not require key order guarding, we just rely on usual `GuardManger + DictGetItemGuardAccessor`. This is faster than going through the `list(d.keys())` based design for OrderedDicts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124779
Approved by: https://github.com/jansel
This subsumes https://github.com/pytorch/pytorch/pull/124069
In the original PR, my idea was that when we run PropagateUnbackedSymInts, we check that the sizes before and after are exactly the same. This ended up turning up lots of bugs that I didn't feel like fixing. Separately, Ivan let me know that this pass was quite expensive in terms of compile time, since we spent a lot of time thinking about the equalities.
To kill two birds with one stone, we now only check for equality precisely when an unbacked SymInt was bound (thanks to the previous PR in this stack, we now have this information). Specifically, we look to see if `meta["unbacked_bindings"]` is set on the old node, and if it is, we assert the old value is equal to the new value from the repropagation. Note that the pytree key is used to actually extract the new value from the example value, as it may be nested inside an, e.g., tensor size.
We do something a bit naughty at the end: we use `defer_runtime_assert` to actually teach ShapeEnv about the equality. This is implementationally equivalent to what we used to do, but we're going to change this later soon.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124297
Approved by: https://github.com/lezcano
ghstack dependencies: #124290
Closes#114966
Frozen field assignment in `__init__` in Python 3.8-3.9:
f5bd65ed37/Lib/dataclasses.py (L402-L411)
```python
import builtins
BUILTINS = builtins
def _field_assign(frozen, name, value, self_name):
# If we're a frozen class, then assign to our fields in __init__
# via object.__setattr__. Otherwise, just use a simple
# assignment.
#
# self_name is what "self" is called in this function: don't
# hard-code "self", since that might be a field name.
if frozen:
return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})'
return f'{self_name}.{name}={value}'
```
Frozen field assignment in `__init__` in Python 3.10+:
812245ecce/Lib/dataclasses.py (L436-L445)
```python
__dataclass_builtins_object__ = object
def _field_assign(frozen, name, value, self_name):
# If we're a frozen class, then assign to our fields in __init__
# via object.__setattr__. Otherwise, just use a simple
# assignment.
#
# self_name is what "self" is called in this function: don't
# hard-code "self", since that might be a field name.
if frozen:
return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})'
return f'{self_name}.{name}={value}'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393
Approved by: https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+.
In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame.
Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238
Approved by: https://github.com/jansel
Some changes to how we handle blocks in 3.11+:
- We only keep track of with blocks that are not enclosed in a try block
- We do not compile partial graphs if we are in a block that is not in a tracked with block - i.e. any block enclosed in some non-with try/except/etc. block
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123978
Approved by: https://github.com/jansel
Accessing co_lnotab causes a deprecation warning to be issued, causing some dynamo-wrapped tests to fail. We do not need to remove co_lnotab from tests as of now, as they are still useful as an additional check for linetable correctness, but we will need to deal with co_lnotab removal by 3.14.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123577
Approved by: https://github.com/jansel
`JUMP_BACKWARD` in 3.12+ may not be in the exception table even though it should be considered a part of the block. Also fix a issue where we didn't propagate the exception table entry to new instructions when expanding the `POP_JUMP_IF_[NOT_]NONE` instruction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123392
Approved by: https://github.com/jansel
Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed:
- Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11.
- The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.")
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146
Approved by: https://github.com/jansel
Previously, when we applied a replacement, a SymInt that was
previously an unbacked SymInt would then transmute into whatever
we replaced it into (e.g., a constant).
This has a major downside: we often look at SymInts associated with
FX nodes (e.g., the meta of x.item() return) to find out where the
unbacked SymInt was allocated. If we replace it, we no longer can
find out where, e.g., u1 was allocated! But we need to know this
so we can generate deferred runtime asserts like u1 == s0.
To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites:
* `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`.
* `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred.
* `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False`
* `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue
* `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression.
There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming.
Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR.
Fixes https://github.com/pytorch/pytorch/issues/119689
Fixes https://github.com/pytorch/pytorch/issues/118385
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816
Approved by: https://github.com/lezcano
Fix: https://github.com/pytorch/xla/issues/6009
This PR adds another case to `TensorVariable.method_new` special case, where it
re-dispatches `new` into `new_empty`.
Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding
backend (e.g. XLA). So, things like the following might happen:
```python
@torch.compile(backend="openxla")
def foo(x):
new_x = x.new(*x.size())
# new_x.device() == "xla"
# x.device() == "xla:0"
return new_x + x
a = torch.arange(10)
foo(a.to(xm.xla_device()))
```
Resulting in the following error:
```python
Traceback (most recent call last):
...
File "torch/_dynamo/utils.py", line 1654, in get_fake_value
ret_val = wrap_fake_exception(
File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
return fn()
File "torch/_dynamo/utils.py", line 1655, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "torch/_dynamo/utils.py", line 1776, in run_node
raise RuntimeError(make_error_message(e)).with_traceback(
File "torch/_dynamo/utils.py", line 1758, in run_node
return node.target(*args, **kwargs)
File "torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl
output = self._dispatch_impl(func, types, args, kwargs)
File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl
return self.wrap_meta_outputs_with_default_device_logic(
File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic
return tree_map(wrap, r)
File "torch/utils/_pytree.py", line 900, in tree_map
return treespec.unflatten(map(func, *flat_args))
File "torch/utils/_pytree.py", line 736, in unflatten
leaves = list(leaves)
File "torch/_subclasses/fake_tensor.py", line 1550, in wrap
) = FakeTensor._find_common_device(func, flat_args)
File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device
merge_devices(arg)
File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices
raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0
```
Using `new_empty`, instead, fixes this error because it uses the device from the source
tensor, instead of inferring from the current dispatch key set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075
Approved by: https://github.com/jansel