This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197
Approved by: https://github.com/ezyang
Summary: We were getting DDE on reshape still!! i looked deeper and found an issue in _view_has_unbacked_input namely when input is [[,,]] it need to be normalized to [..]
Test Plan:
existing tests.
Rollback Plan:
Differential Revision: D79951119
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160255
Approved by: https://github.com/bobrenjc93
Python dispatcher is not always enabled in fake tensors and have to be called explicitly.
While it should be, it requires some work to get all tests working.
I have been running in several issues where I add to add enable_python_dispatcher ex
XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406
Approved by: https://github.com/bobrenjc93
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.
While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.
Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?
Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
## What
- use `definitely_contiguous_for_memory_format` instead of `is_contiguous` when the non-contiguous case is fine if we encounter a DDE.
- use ref's contiguous over Aten's contiguous because Aten's version will DDE and stop tracing. ref's version will use `definitely_contiguous_for_memory_format` and clone if there's a DDE.
## Example DDEs
- Fixed with `definitely_contiguous_for_memory_format` in `fast_binary_impl`
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq((u0//387), 0) (unhinted: Eq((u0//387), 0)). (Size-like symbols: u0)
Caused by: layer_norm = self.layer_norm(linear) # caffe2/test/export/test_export.py:4566 in forward (_subclasses/fake_impls.py:1022 in fast_binary_impl)
```
- Fixed with `refs.contiguous` instead of calling aten's contiguous (that'd require a bigger re-write in Aten)
```
File "c10/core/TensorImpl.h", line 825, in torch::autograd::THPVariable_contiguous(_object*, _object*, _object*)
File "c10/core/SymbolicShapeMeta.h", line 87, in c10::TensorImpl::is_contiguous_default(c10::MemoryFormat) const
File "c10/core/SymbolicShapeMeta.cpp", line 250, in c10::SymbolicShapeMeta::init_is_contiguous() const
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(128*((u0//387)), 0) (unhinted: Eq(128*((u0//387)), 0)). (Size-like symbols: u0)
Caused by: (_refs/__init__.py:3302 in native_layer_norm)
```
- Fixed with `definitely_contiguous_for_memory_format` in ref's contiguous
```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression 387*((u0//387)) < 2 (unhinted: 387*((u0//387)) < 2). (Size-like symbols: u0)
Caused by: (_prims_common/__init__.py:279 in is_contiguous)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155260
Approved by: https://github.com/laithsakka
ghstack dependencies: #155499
Handles GC for non-strict draft export; GPU memory usage shouldn't be much more than eager mode + input tensors now.
While trying to do draft export CPU offloading, I found out GC is feasible, because in non-strict, there's 2 places holding references to a `.real_tensor` attribute:
1) the FakeTensors in fake tensor prop, but these are held by the actual variables in the model's forward call, and so the real tensor gets gc-ed along with the fake one when the variable goes out of scope.
2) A clone of the fake tensor in 1) stored in `proxy.node.meta["val"]`, which was added in https://github.com/pytorch/pytorch/pull/150948. But we didn't actually need to store them on intermediate values; the placeholders are enough for retracing/lowering.
Avoiding storing the intermediate values in 2), the values in 1) should be naturally GC-ed, and the real-tensor memory usage for non-strict should be pretty similar to eager computation?
Strict still OOMs; dynamo still holds these in variable tracking, and not sure how to GC those.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154630
Approved by: https://github.com/angelayi, https://github.com/yushangdi
Summary: We need real_tensor on the FakeTensor in node.meta["val"] in order to aot_compile the draft exported programs. Otherwise, we cannot propagate real tensors even when fake_mode.propagate_real_tensors = True.
This also fixes real tensor propagation in `run_decomposition()`.
Test Plan:
```
buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_dedup_data_dependent_failure
```
Differential Revision: D72732714
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150948
Approved by: https://github.com/angelayi
Summary:
When we divide a FakeTensor by an integer using the fast op implementation, the type promotion should be `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` so we get a float when dividing an int FakeTensor by an integer.
```
FAST = get_fast_op_impls()
fast_div = FAST[torch.ops.aten.div.Tensor]
fast_div(fake_tensor, some_int)
```
Test Plan:
```
python test/test_fake_tensor.py -k test_fast_div
```
Differential Revision: D72667430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150874
Approved by: https://github.com/angelayi
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
During export, we nub out most CIA ops to return NotImplemented to avoid decomposing them during tracing. To recover the existing shape propagation behavior, we register these CIA decomps directly as FakeTensorMode rules as well. The reason we have to do is because when we return NotImplemented, FakeTensor would fallback to running these CIAs with Meta backend causing device branching CIA ops to fail. (because now the device is Meta. One example is sdpa). If we register a kernel directly to FakeTensorMode, we won't fallback to Meta backend.
Differential Revision: [D65716260](https://our.internmc.facebook.com/intern/diff/D65716260/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140465
Approved by: https://github.com/bdhirsh
### Summary
The fake impl for `nonzero` sets the symint's upper range to `sys.maxsize - 1` if there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape.
See https://github.com/pytorch/pytorch/pull/134899 as a merged solution for a similar problem for a different op.
### Test plan
Added unit test to verify upper bound reduction calculation (`python test/export/test_export.py TestExport.test_nonzero_dynamic`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137663
Approved by: https://github.com/ezyang