The big idea is to add `create_unbacked_symfloat` and `create_unbacked_symint` to ShapeEnv, allowing you to allocate symbolic floats/ints corresponding to data you don't know about at compile time. Then, instead of immediately erroring out when you try to call local_scalar_dense on a FakeTensor, we instead create a fresh symint/symfloat and return that.
There a bunch of odds and ends that need to be handled:
* A number of `numel` calls converted to `sym_numel`
* When we finally return from item(), we need to ensure we actually produce a SymInt/SymFloat when appropriate. The previous binding code assumed that you would have to get a normal Python item. I add a pybind11 binding for Scalar (to PyObject only) and refactor the code to use that. There is some trickiness where you are NOT allowed to go through c10::SymInt if there isn't actually any SymInt involved. See comment.
* One of our unit tests tripped an implicit data dependent access which occurs when you pass a Tensor as an argument to a sizes parameter. This is also converted to support symbolic shapes
* We now support tracking bare SymInt/SymFloat returns in proxy tensor mode (this was already in symbolic-shapes branch)
* Whenever we allocate an unbacked symint, we record the stack trace it was allocated at. These get printed when you attempt data dependent access on the symint (e.g., you try to guard on it)
* Subtlety: unbacked symints are not necessarily > 1. I added a test for this.
These unbacked symints are not very useful right now as you will almost always immediately raise an error later when you try to guard on them. The next logical step is adding an assertion refinement system that lets ShapeEnv learn facts about unbacked symints so it can do a better job eliding guards that are unnecessary.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90624
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
* Skip a unittest that needs FFT if not built with FFT
* Mark a test with "slow": `python test/test_ops.py -k TestCompositeComplianceCUDA.test_forward_ad_svd_lowrank_cuda_float32` took >5min on my machine.
* Skip a flaky test that's marked "expectedFailure", similar to #90233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90609
Approved by: https://github.com/soumith
This lowers the `reduce_dtype` retrieval to the `handle` instead of the `state` in preparation for `fully_shard`, and this adds a guard to avoid a no-op `to()` call.
Note that this change pretty much gets overridden in following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90615
Approved by: https://github.com/rohan-varma
Use register_state_dict_pre_hook in FSDP to simplify state_dict implementations & remove hacks. This removes `def state_dict` entirely and paves the path for composable API as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90436
Approved by: https://github.com/fegin
This saves a data structure `_stream_to_name: Dict[torch.cuda.Stream, str]` that maps each FSDP stream to its name. This can help in debugging by checking `_stream_to_name[torch.cuda.current_stream()]` to see if it is `"default"` or `"unshard"` in the post-backward hook for example.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90611
Approved by: https://github.com/rohan-varma
Optimizes the nccl python bindings to reserve space when converting PythonObject* into Tensors. This should reduce the number of unnecessary allocations in the nccl bindings as the std::vector grows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88203
Approved by: https://github.com/ezyang
Applies some more missing std::move found by static analysis. This should improve performance and reduce unnecessary copies. This PR only targets ATen for now.
And before you ask about the edits, std::move is optimal in a ternary operator as copy ellision cannot happen one. The best thing is probably rewriting it as an if else, but ultimately this should be performant enough.
Followup to #88512 and #88514
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89000
Approved by: https://github.com/ezyang
Instead of inferring shape mappings from a bunch of data structures that were plumbed in InstructionTranslator, we instead work out mappings by just iterating over the GraphArgs and mapping symbols to arguments as they show up. If multiple argument sizes/strides/offset map to the same symbol, this means they are duck sized, so we also generate extra equality tests that they must be equal. Finally, we generate 0/1 specialization guards. The resulting code is much shorter, and I think also easier to understand.
TODO: Delete all the tensor ref tracking code, it's unnecessary
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90528
Approved by: https://github.com/voznesenskym
So, uh, I have a new strategy for generating dupe guards, one where I don't actually need to allocate symints for every tensor that is fakeified. So I'm reverting the changes I made from earlier PRs in this one.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90381
Approved by: https://github.com/voznesenskym
Wow, I had to sweat so much to get this PR out lol.
This PR enforces the invariant that whenever we allocate SymInts as part of fakeification, the SymInt is associated with a Source, and in fact we store the string source name on SymbolWithSourceName. We use 'sname' as the shorthand for source name, as 'name' is already used by sympy to name symbols.
In order to store source names, we have to plumb source names from Dynamo to PyTorch. This made doing this PR a bit bone crushing, because there are many points in the Dynamo codebase where we are improperly converting intermediate tensors into fake tensors, where there is no source (and there cannot be, because it's a frickin' intermediate tensor). I've fixed all of the really awful cases in earlier PRs in the stack. This PR is just plumbing in source names from places where we do have it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90295
Approved by: https://github.com/voznesenskym
Summary:
This pull request makes some LazyGraphExecutor private data structures protected such that XLAGraphExecutor can reuse them.
Here is the list:
1. DeviceLocker.
2. DeviceLockerArena.
3. DataCacheArena. In addition, it also introduces LazyGraphExecutor::ResetTrimCounter() such that XLAGraphExecutor can reuse the trim counter.
Test Plan:
CI.
P.S. This is to re-land #90457.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90598
Approved by: https://github.com/JackCaoG
Retry of #89595. Accidentally closed.
## Forked `BaseCppType`
Created a module for Executorch: `torchgen.executorch`.
In `torchgen.executorch.api.types.types`:
* Define `BaseCppType` with `torch::executor` namespace.
In `torchgen.executorch.api.et_cpp`:
* Help generate `NamedCType` for `ExecutorchCppSignature` arguments.
In `torchgen.executorch.api.types.signatures`:
* Define the signature using these types. (`ExecutorchCppSignature`)
In `torchgen.executorch.api.types.__init__`:
* Suppress flake8 error for `import *`.
Differential Revision: [D41501836](https://our.internmc.facebook.com/intern/diff/D41501836/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90591
Approved by: https://github.com/iseeyuan
A retry of #89487. Accidentally closed.
## Split `torchgen.api.types` into `types_base`, `types` and `signatures`.
In `types_base`:
* Created base class `CType`. `BaseCType` and `ConstRefCType` etc are inheriting `CType`.
* Only keep abstract type model definitions, such as `BaseCppType`.
In `types`:
* Define `BaseCppType` with `at` and `c10` namespaces.
* All the signatures using these types.
In `signatures`:
* Define all the signatures.
In `__init__`:
* `from ... import *`, suppress flake8 error.
Differential Revision: [D41455634](https://our.internmc.facebook.com/intern/diff/D41455634/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41455634/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90589
Approved by: https://github.com/iseeyuan
Variable length arguments can overflow the arena being used to keep overhead
low for torch dims. If we hit this case, we know the amount of work being done
is already relatively big, so we just fallback to standard memory allocation.
Fixes#88586
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88596
Approved by: https://github.com/ezyang
This adds a d3-based interactive visualization for exploring the memory
allocation traces that the caching allocator can capture. This visualization
code can also be attached to kineto trace information in the future to also
provide visualization for the memory events captured there, which come with
addition information about the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90348
Approved by: https://github.com/robieta