We introduced a special graph break to avoid max-recursion-depth error
in #100296.
After #111415, the original `test_list_self_reference` no longer
triggers the special graph break because we started modeling root frame
free variables with `LazyVariableTracker`.
After #117426, we no longer build the list items eagerly, and they'll hit
`variable_tracker_cache` when they get lazily constructed later.
As a result, this patch updates the `test_list_self_reference` test and
removes the special graph break.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142438
Approved by: https://github.com/jansel
ghstack dependencies: #142437
This PR removes the functorch config that set an upper limit on the number of aliased
inputs with dynamic shapes. After moving them to be run at runtime in C++, the compilation
time and runtime (in true alias cases) improved, rendering the error no longer relevant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141680
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555, #140013
Fixes https://github.com/pytorch/pytorch/issues/141710. Previously, if we called flash attention with a `SymFloat` scale that was properly symbolic, we would unsafely cast its raw `SymFloat._data` into a `float`, which is pretty much guaranteed to give `NaN`.
This avoids the NaNs in the linked issue, but I'm not sure if it's worth landing yet because we'll start specializing and recompiling for every distinct `scale` that is passed in (which in the dynamic shapes case, is some function of `query.size(-1)`).
The real fix would be to ensure that the flash attention (and related) ops all accept a symbolic version of the `scale`. I'm not sure if we should use `SymFloat` or `Scalar` though - more discussion in the issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141725
Approved by: https://github.com/ezyang
There has been a series of attempts to provide support for resizing in
torch operators like `torch.sigmoid(x, out=y)`, i.e., `y` would have a
different shape before and after this expression. Prior to this patch,
we have some checks to graph break if the shape changed.
This patch extends
1. extends the existing check and graph break for any shape change, not
just for `TensorVariable` with source field.
2. removes an old code path which was introduced to address the shape
change, but became obselete in that regard because we added extra
checks to graph break upon shape change. Moreover, this old code path
is unsound, it tries to replace references to the old
`TensorVariable` the new one returned by `wrap_fx_proxy`, but it only
does the replacement in `symbolic_locals`, which breaks when cells
are involved. In general the old `TensorVariable` could be _anywhere_,
think the `replace_all` we had for immutable VTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140202
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150, #140151, #140201
Before this PR, calling `is_nested` in-graph would result in graph code like the following:
```python
class GraphModule(torch.nn.Module):
def forward(self, L_nt_: "f64[3, s1, 5]", s1: "Sym(s1)"):
l_nt_ = L_nt_
# Note this useless line!
getattr_1 = l_nt_.is_nested; getattr_1 = None
add: "f64[3, s1, 5]" = l_nt_ + 2; l_nt_ = None
return (add,)
```
This PR follows what is done for `is_sparse` / `is_quantized`: store it onto `TensorVariable` and have `getattr` calls to `is_nested` return the stored value as a constant. This removes the useless line above from the graph. Note that guarding is handled through tensor type check guards, so no need to guard on `is_nested` status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138841
Approved by: https://github.com/soulitzer
Fixes https://github.com/pytorch/pytorch/issues/136640
Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.
In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.
In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```
I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.
I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:
(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions
(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.
Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/136640
Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.
In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.
In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```
I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.
I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:
(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions
(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.
Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).
This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.
------
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh