Fix: #141974
This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.
In summary, these are the changes this PR introduces:
- `ViewMeta` is turned into an abstract class (see
_FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
functions that need to be implemented. `to_out_index` should be
implemented by operations that might return more than 1 output.
- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
created (see _FunctionalizeFallbackKernel.h_).
- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
declaration and definition of the `ViewMeta` specializations, which
are automatically generated in the ATen codegen (see _gen.py_).
- New `_functionalization` Python sub-module is created (see
_Module.cpp_). It serves as namespace for the `ViewMeta`
specializations and `InverseReturnMode` enum.
- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
the automatically generated Python bindings for the `ViewMeta`
specialization, which are generated in the torch codegen (see
_generate_code.py_).
Note that this PR makes use of codegen at 2 different moments:
- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
@tugsbayasgalan found a bug for nested subclasses:
E.g. we have
TwoTensor(TwoTensor(t1, t2), t0).
After right_inverse we have:
rebuilt_stack == [(TwoTensor, meta, ["a", "b"]), (TwoTensor, meta, ["a", "b"])]
plain_tensors == [t0, t1, t2]
We will first put plain tensors, and only then the nested TwoTensor.
But when we unflatten:
todo = [t0, t1, t2]
we first create TwoTensor[t1, t2]
put it to todo [t0, TwoTensor[t1, t2]]
And as a result get
TwoTensor(t0, TwoTensor(t1, t2))
which is swapping original a and b :)
So the fix should be different, we need to preserve the order of elements in the stack for plain/subclasses.
I will think about the fix.
Fix:
Keep order of inner_tensor_attr_names according them added to the stack. (first - plain tensor attributes, then subclass attributes)
Test:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```
Differential Revision: [D67032477](https://our.internmc.facebook.com/intern/diff/D67032477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142481
Approved by: https://github.com/tugsbayasgalan, https://github.com/bdhirsh
This implements a new wrapper class AOTDispatchCompiler wrapper, which is just a wrapper around a callable that returns an OutputCode. We can then use it in AOTDispatch to decide whether or not to use the cache: if fw_compiler, bw_compiler and inference_compiler are all AOTDispatchCompilers, then we enable caching.
This type is pretty close to _CompiledFxGraphCallable, except it's not allowed to take any kwargs. Not sure how to consolidate the two ideas together just yet: unfortunately, there's no way to properly annotate the types to make them related. But a lot of the time, the input to this function will be a partially applied _CompiledFxGraphCallable.
This allows the PR above this one to enable AOTAutogradCache everywhere, but not increase instruction count or enable cache on unit tests that use aot_eager or other non inductor compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142205
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
Allow mutations mutations for subclasses that are non-contiguous.
Changes:
Removing assert in collect_metadata_analysis
Main requested testcase:
Compilation of NJT.index_put()
Adding test in test_nestedtensor.py, that compiles NJT.index_put()
It is decomposed to NJT split,unbind, which needed additional `torch._check`, `torch._check_is_size` for NJT.unbind() and guard_size_oblivious() usage in _meta_registrations and _inductor/lowering.py.
Special case:
If tangent is mutated outside of the graph, it does not participate in backward graph. Autograd in this case will set this tangent to zeros tensor.
We handle it separately in CompiledFunction.backward: not doing any processing for this tangent and broadcast to number of expected subclass unwrapped arguments.
disabling for dynamo 2 tests:
1/ For nested tensor - symbolic shapes issue on nested_tensor index operation that does splits [0, 0, 0] - there is a failure with "pending unbacked symints". This PR does not add more .tolist()/item() ops than it was before.
2/ As we do not fail with exception in collect_metadata_analysis new paths for dynamo started working and it started failing with smth strange that set_ in storage_offset (because of test for views) handling updates storage "cpu" -> "meta"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139630
Approved by: https://github.com/bdhirsh
Parametrization can not be registered for non-direct child parameters of the module.
We have to iterate through all submodules and register parametrization at every level.
Original testcase did not test the nested modules case - adding submodule to the test.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142155
Approved by: https://github.com/bdhirsh
This PR fixes the issue where AOTAutograd would produce a guard that used a symbolic value
that came from one of the input's base.
```python
@torch.compile(backend="aot_eager", dynamic=True)
def f(a, b):
a.add_(1)
b.add_(1)
return a
x = torch.ones(10)
f(x[1:], x[1:])
```
In the example above, AOTAutograd functionalizes the mutation by making use of
`as_strided_scatter` operation, which produces the guard: `s0 >= s1 + 1`, where:
- `s0`: corresponds to `x.size()[0]`
- `s1`: corresponds to `a.size()[0]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139554
Approved by: https://github.com/bdhirsh
FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things:
- It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this.
- It refactors the code to be way more clear about the constants we're using and when we're using them.
Basically, there are two possible sets of constants we can grab from the compiled fx graph.
1. If freezing is turned off, we save the constants directly in CompiledFxGraph.
2. If freezing is turned on, we save the *names* of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch.
We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed.
This makes it clear that the gm passed to AOTAutogradCache is *not* part of post compile, only the cache key generated from it is.
The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing.
Will add a specific AOTAutogradCache to confirm we bypass freezing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897
Approved by: https://github.com/ezyang, https://github.com/masnesral
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.
This also moves two helper functions to output_code.py as well.
Also set _boxed_call at constructor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
Co-authored-by: James Wu <jjwu@meta.com>
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.
This also moves two helper functions to output_code.py as well.
Also set _boxed_call at constructor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
Co-authored-by: James Wu <jjwu@meta.com>
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731
Introducing ability for subclass implement type convertion to expected_type.
```
def __coerce_same_metadata_as_tangent__(
self, expected_metadata: Any, expected_type: Optional[Type] = None
):
```
Here if `expected_type=None` means `SubclassClass` is expected.
E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime
Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
Reason:
Currently we have multiple traversals for tangents in runtime:
- To check that types and structure are identical to what we guessed during tracing time
- Coerce metadata
- Coerce memory_format
- Unwrap_tensor_subclass
All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses.
Change:
To do everything in one traversal at runtime (including flattening)
Implementation details:
Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too.
Preparing memory_format is optional (controlled by with_memory_format=True).
2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068
Approved by: https://github.com/bdhirsh
Original Issue: https://github.com/pytorch/pytorch/issues/134644
We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing.
=>
Tracing time:
- Store trace_tangents_memory_formats in metadata
- Coerce tangents to deduced memory_format
Runtime:
- Coerce tangents to tracing memory format from metadata
Subclasses logic:
- Previously coercing tangents logic did not handle nested subclasses case, fixing this.
For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass:
[subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ]
If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure.
The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method).
Other small change:
Remove duplicated not-related comment.
Testing
```
python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous
```
Benchmarking:
After change:
```
└─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous
Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms
Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms
```
Before change:
```
BEFORE_CHANGE SUBCLASS 4.1194
```
No siginificant changes in processing time.
(We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225
Approved by: https://github.com/bdhirsh
Cleanup:
1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper
2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat.
Verfified fbcode tests compiled_optimizers
Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549
Approved by: https://github.com/bdhirsh
Original issue:
https://github.com/pytorch/ao/issues/890
The problem:
TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.
flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.
This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.
Test Plan: (Meta only tests)
Reviewed By: aorenste
Differential Revision: D62384944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
Original issue:
https://github.com/pytorch/ao/issues/890
The problem:
TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.
flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).
This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.
------
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
The issue:
Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition.
As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified.
Solution:
If we have subclasses arguments, do not count that const propagation is doable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855
Approved by: https://github.com/zou3519
This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit.
Specifics:
- AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward.
- We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time.
```
TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv
```
Twice, once on cache miss and once and cache hit.
Here is the perfetto trace for each(FB only link):
**Cache Miss:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.66 seconds
I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry
I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry
I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 101.24 seconds
Average tokens/sec: 147.96 tokens/sec
Average bandwidth achieved: 1955.22 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

**Cache Hit:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.67 seconds
I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 9.40 seconds
Average tokens/sec: 147.94 tokens/sec
Average bandwidth achieved: 1954.98 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294
Approved by: https://github.com/eellison
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.
I annotated the PR with explanation of changes.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
Rewrite of original PR in https://github.com/pytorch/pytorch/pull/130291
To answer review comments from https://github.com/pytorch/pytorch/pull/130291#pullrequestreview-2166671953:
> At a higher level, do we need this?
Today, this should not change the behavior of anything. But an invariant of "same tensor always corresponds to the same FakeTensor" is nice (from discussion with @bdhirsh).
> Why does this happen?
Today, both dynamo and meta_utils do some recursion when it comes to FakeTensors. So whenever we fakify a subclass, the process would roughly like:
```
wrap_to_fake (subclass)
meta_utils (subclass)
meta_utils (values) -> not cached because we use callback
meta_utils(offsets) -> not cached because we use callback
wrap_to_fake (values)
wrap_to_fake (offsets) -> cached because we rely on top-level meta_utils
```
However, we know that:
- Caching only occurs at the top-level of meta_utils.
- The return value of the top-level wrap_to_fake is returned.
This means that after all of this:
- The fakified subclass holds inner FakeTensors that are NOT part of the cache
- values/offsets are Fakified a second time, and those instances are cached.
Differential Revision: [D60406773](https://our.internmc.facebook.com/intern/diff/D60406773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131803
Approved by: https://github.com/ezyang
ghstack dependencies: #131916
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
Original issue: https://github.com/pytorch/pytorch/issues/114338
Reland of: https://github.com/pytorch/pytorch/pull/128016
Summary from previous PR:
We assume only two possible mutually exclusive scenarios:
Running compiled region for training (Any of inputs has requires_grad)
Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)
All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Changes in partitioner?
Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.
Why was revert?
There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```
Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).
Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.
As a result we started compiling training graph instead of inference.
Fix for view ops:
If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.
This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
Brian debugged the difference of the output type for inference and train graph.
Partitioner sometimes return list output type.
After this PR it will always return tuple.
Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing.
This could be easily fixed with fast-forward fix on:
```
EXPECTTEST_ACCEPT=1 python test/test.py
```
Adding ciflows/periodic to minimize this probability
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
https://github.com/pytorch/pytorch/issues/127561
Mutations of inputs in backward are emitted manually, after joint_fn tracing.
With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs.
To keep those mutations in backward:
- Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward"
In partitioner:
Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130
Approved by: https://github.com/Chillee
Reland of: https://github.com/pytorch/pytorch/pull/128016
Summary from previous PR:
We assume only two possible mutually exclusive scenarios:
Running compiled region for training (Any of inputs has requires_grad)
Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)
All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Changes in partitioner?
Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.
Why was revert?
There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```
Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).
Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.
As a result we started compiling training graph instead of inference.
Fix for view ops:
If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.
This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
Reland https://github.com/pytorch/pytorch/pull/128709.
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.
We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.
Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493
Approved by: https://github.com/BoyuanFeng
Reland of: https://github.com/pytorch/pytorch/pull/128016
Summary from previous PR:
We assume only two possible mutually exclusive scenarios:
Running compiled region for training (Any of inputs has requires_grad)
Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)
All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Changes in partitioner?
Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.
Why was revert?
There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```
Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).
Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.
As a result we started compiling training graph instead of inference.
Fix for view ops:
If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.
This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.
We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.
Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,
Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`.
On a high level, the issue is caused by not detecting fake_mode when there's no input.
Change plan:
1) we add a `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`.
2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg.
3) If the input is a graph module, then we can traverse the graph and check.
4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type.
Fixes#129927
Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo.
Change plan:
check `gm_torch_level` graph's node meta "example_value" for fake mode in addition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928
Approved by: https://github.com/angelayi
This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work.
To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335
Approved by: https://github.com/bdhirsh
Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329
Approved by: https://github.com/bdhirsh
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.
We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.
Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
Recently we decided to split export IR into two different IRs (training vs inference). In the inference IR, one major change we decided to introduce was we wanted to keep the composite ops that user specified in the IR. This PR does that by overriding the CompositeImplicitAutograd decomp in export inference path.
Differential Revision: [D58701607](https://our.internmc.facebook.com/intern/diff/D58701607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128077
Approved by: https://github.com/bdhirsh
This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203
Approved by: https://github.com/bdhirsh
This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache.
To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss.
We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future.
In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222
Approved by: https://github.com/bdhirsh
https://github.com/pytorch/pytorch/issues/127572
Allow mutations in backward on forward inputs, if
1/ not mutationg metadata
Enforced at compilation time.
2/ if create_graph=True: mutated input does not require_grad
Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled()
Adding input_joint_info to track mutations of inputs during joint.
Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409
Approved by: https://github.com/bdhirsh
Original issue: https://github.com/pytorch/pytorch/issues/114338
We assume only two possible mutually exclusive scenarios:
1. Running compiled region for training (Any of inputs has requires_grad)
- Produced differentiable outputs should have requires_grad.
2. Running compiled region for inference (None of inputs has requires_grad)
- All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016
Approved by: https://github.com/bdhirsh
If any inputs are mutated that require grad, even if all the outputs don't require grad, we should still run autograd with a backwards graph. This fixes two tests: test_input_mutation_alias_everything and test_view_detach.
Fixes#128035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128229
Approved by: https://github.com/aorenste
My goal is to run these tests with the autograd cache on, but first I want them running with dynamo. These tests already caught an interesting issue so I thought it would be helpful to just have them.
Next up I'll have a second subclass of these tests, run them twice, and expect a cache hit the second time from autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128047
Approved by: https://github.com/ezyang
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123, #127124
Fix: #125387
This PR helps keep track of whether an instantiated `ViewMeta` has symbolic values as
input or not. This is used for checking whether we use the AOTAutograd `ViewMeta`-replay
execution path, e.g. it doesn't support tensors that have `ViewMeta` with symbolic inputs.
In summary, the changes are:
- Add the field `ViewMeta::has_symbolic_inputs` and make it a required constructor
parameter
- Add the field `FunctionalTensorWrapper::is_symbolic_` and the method
`FunctionalTensorWrapper::maybe_mark_symbolic`
- Marks a `FunctionalTensorWrapper` as symbolic iff any of its `ViewMeta` have
symbolic inputs
- Add the plumbing of `FunctionalTensorWrapper::is_symbolic` to the Python API
- Codegen the computation of `ViewMeta::has_symbolic_inputs` for each view operation
- Use the AOTAutograd `ViewMeta`-replay path if:
- `target_functional_tensor` is not `None`; and
- `target_functional_tensor` is not symbolic (instead of using a functorch config)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125876
Approved by: https://github.com/ezyang
More details further down, but first a more high-level description of "how do we functionalize storage resizing"
Today, dynamo converts `param.untyped_storage().resize_(x)` calls that it sees from fsdp into a custom op, `ops.inductor.resize_storage_bytes_(x)`
So given this setup, there are 3 main cases that I think we want to handle:
(1) graph input starts with a real storage size, gets resized down to zero in the graph
(2) graph input starts with 0 storage size, gets resized up in the graph
(3) graph input starts with 0 storage size, gets resized up and used in some compute, then resized back down to 0
For case (1) we need to emit a `resize_storage_bytes_` at the end of the graph, similar to how we emit `copy_()` for data mutations.
For case (2), we need to emit a `resize_storage_bytes_` in the graph, and we **also** need to emit a `copy_()` (the input had its storage resized up, and filled in with data, which is we need to reflect as an input mutation)
For case (3), the net effect is that the input had no data on entry and exit of the function, so we don't need to emit any mutable ops in the end of the graph.
The main thing to call out is that: we need to write a functionalization rule for `resize_storage_byte_`, (`FunctionalTensorWrapper::storage_resize_()`) and this rule actually does very little. We would like to **not** emit any new ops in the graph (like say, a functional resize op). Instead, we should expect / rely on the fact that any resize up will be immediately followed by a `copy_()`/`foreach_copy_`/`out=` op, that will fill in the data of the tensor. So `FunctionalTensor` can temporarily live in a state where its data is invalid, until the `x.copy_(y)` "updates" its data with the new tensor.
So effectively, all that this rule does is:
(1) it stores metadata on the storage, indicating that the tensor was resized, as well as the updated storage size. We need this info in AOTAutograd, so it knows whether to emit a mutable resize_() op in the graph epilogue
(2) There is also a corner case: if we are resizing down to zero, but our tensor had **previously** had a zero size storage, then we update `value_` to point to the original value of the tensor. The reason this seems safe is because if we have a zero storage sized tensor `x`, and we resize it up, use it in some compute, resize it back down to zero, and use it somewhere, we would want the functional version of this code to use the original `x` after the second resize. For FSDP, this is important because we end up saving parameters (graph inputs) for backward, and we want to make sure that the thing we save (and the output to the forward graph) is the original, zero-storage-sized parameter, and not the "version 2" of the parameter after the first resize_()
I think a good order to look at changes in this PR would be:
(1) `test_aotdispatch.py` shows the 3 main cases I focused on as well as the expected functionalized graphs
(2) In `FunctionalStorageImpl.h/cpp`, I had to add a notion of "original base", and "original/curr_size". The first is so I can re-use the zero-size tensor after multiple resizes, and the second is so I can tell in AOTAutograd whether any resizes canceled each other out into a no-op
(3) FunctionalTensorWrapper.h/cpp has the new resize functionalizion rule + some extra utils
(4) `_functorch/_autograd`: the main changes in this folder were around adding the logic at trace-time to detect when we need to put a resize_() in the graph. I also have some assertions to check that any inputs that experience storage resizing will **always be in the graph** and not the opaque epilogue, and I also limited the resize_() mutation case so that you can only ever start with zero storage, or end with zero storage (you can't do e.g. `torch.ones(2).storage().resize_(3)`), and banned it on tensor subclasses
(5) `fake_tensor.py`/`meta_utils.py`: we now need to be able to fakeify tensors with zero storage, so I added a quick version of it in meta_utils.py. This also.. has ramifications for fake tensor caching that I need to fix (include the storage size on the cache key, maybe?)
------------------
This PR subsumes https://github.com/pytorch/pytorch/pull/120971.
This PR is enough to **almost** get a simple ppFSDP forward pass tracing with a functionalized resize_() properly. It also attempts to do the updated version from @jansel, where we don't have any notion of `resize_()` in the graph at all, post functionalization. It would probably be good to test it with @yf225 's FSDP changes, and see how many of the FX passes it allows us to remove. I think that in theory, it should allow us to remove all FX passes that affect the forward graph / partitioner, **except** the one that forces views to be recomputed in the backward (more details below).
There are a few things worth calling out:
(1) failed attempt at functionalizing `aten.copy_()`. I originally wanted to get a version takes these operations:
```
param.storage().resize_(all_gather_size)
param.copy_(all_gather_buffer)
out = aten.matmul(param, param)
```
and functionalizes them into:
```
out = aten.matmul(all_gather_buffer, all_gather_buffer)
```
This would involve getting functionalization to turn `x.copy_(y)` into a giant no-op that just returns `y`. Unfortunately, we can't actually do this in a reasonable way within functionalization (instead, there's a functional `aten.copy` in the graph - see the test case graph expecttest for details). Why? In order for that transformation to be safe, `x` and `y` need to have the same metadata. However, it's possible for `x` and `y` to be subclasses of different types. This is not something we can easily tell from within functionalization, and would be a layering violation. So for now I'm leaving it to downstream code to optimize away the `aten.copy` (this is already the case today, so I think inductor can handle this)
(2) The forward doesn't **actually** run successfully in this PR (see the `assertRaisesRegex` in the test). Why?
The final forward graph looks like this:
```
def forward(self, primals_1, primals_2):
_foreach_copy = torch.ops.aten._foreach_copy.default([primals_1], [primals_2]); primals_2 = None
getitem = _foreach_copy[0]; _foreach_copy = None
mm = torch.ops.aten.mm.default(getitem, getitem); getitem = None
t_1 = torch.ops.aten.t.default(primals_1); primals_1 = None
return [mm, t_1]
```
Where `primals_1` starts out as a secretly-zero-storage-size parameter, and gets resized up and back down within the forward (these are functionalized away).
Importantly, the matmul happy on the result of the `foreach_copy`, **but** the activation that we save for backward (`t_1`) is the result of transposing the **original parameter** (the zero-storage-size param). This is exactly the optimization in fsdp that allows us to have good peak memory usage.
The problem is that the min-cut partitioner decides to save `t_1` for backward. Running this code in eager breaks, because the kernel for `aten.permute(x)` is not happy when `x` has secretly-zero-sized-storage.
The real problem here is that in eager mode the `permute` kernel runs during the backward, after backward hooks have properly resized the saved activation. Here, we are running the transpose in the forward.
One option would be to turn off the checks in our view kernels and allow them to work on zero-storage-sized tensors, which feels pretty bad. Another option is to tweak the partitioner (or use one of Will's FX passes) to force the partitioner to not save views for backward, and allow the views to be recomputed in the backward. This seems kind of silly, but is also probably harmless.
(3) The backward is still broken. To be fair, this issue is pretty separable from "functionalizing storage resize calls", and can be fixed later (either by a real fix to our tracing infra, or via another hacky FX pass). More description of this problem is described at issue (8) of my PR description in https://github.com/pytorch/pytorch/pull/120971
(4) I only added support for "full graph" resizing: basically, the limited case where a param starts with zero storage size, and gets resized up and back down. I think we can add support for the graph break case, but I think we can keep that add-on separate from this PR unless we need it immediately. I also added asserts so we should fail loudly when we hit this case
(5) I have a change to FakeTensor creation when inputs have zero storage size that.. is probably ok. But I also removed FakeTensor caching on view ops, which I probably need to fix before I can land this PR
(6) I added a notion of "original_base" to `FunctionalStorageImpl`. More details are in the comments, but my rational for this was that we basically need it to ensure that autograd saves the **original**, zero-storage-sized param for backward, after resizing up and back down
(7) I had to update our eager kernels for `aten.copy` and `aten._foreach_copy`, to handle the case where the `self` argument has secretly-zero-storage. Inductor can probably generate correct code for this case, but we need these ops to work properly in this situation for the `aot_eager` backend to do the right thing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122434
Approved by: https://github.com/jansel
Fix: #120336
This PR fixes an issue on AOTAutograd, specifically on backends that don't support views
by themselves (e.g. XLA). Previously, AOTAutograd tried to reconstruct output views by
calling `as_strided` on the concrete bases using sizes and strides of the outputs that
aliased them. Since backends such as XLA doesn't support tensor aliasing, the sizes and
strides would be that of a contiguous tensor (not a view tensor). Because of that, calling
`as_strided` would error, since the output tensor would be bigger than its base. Instead,
this PR applies the sequence of `ViewMeta` gathered for each output during the
functionalization phase.
**Note:** we intentionally don't support base tensors that went through metadata mutation,
i.e. in-place view operations.
In summary, this PR:
- Introduces one `FunctionalTensorWrapper` member function alongside its Python APIs
- `apply_view_metas(base)`: applies the `ViewMeta` sequence of the given instance onto
another base
- Introduces a `OutputAliasInfo.functional_tensor` field
- Saves the `FunctionalTensorWrapper` instance collected by the functionalization phase
- Wraps it with a new `FunctionalTensorMetadataEq` class for comparing only the
metadata of the tensors
- Plumbs `OutputAliasInfo.functional_tensor` to `gen_alias_from_base` function
- Applies the `ViewMeta` sequence of the saved `FunctionalTensor` onto `aliased_base_tensor`
- Propagates `OutputAliasInfo.functional_tensor` when updating `fw_metadata`
(this PR description was updated in order to reflect the most recent changes)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121007
Approved by: https://github.com/bdhirsh
Summary:
Currently, torch.export (through AOTAutograd) compiles with a torch.no_grad() wrapper, which affects the presence of `set_grad_enabled` nodes in pre-dispatch export graphs. This changes the wrapper to nullcontext (i.e. enable grad) if `pre_dispatch=True`.
An example that previously failed without `with torch.no_grad()` is below:
```
class Model(torch.nn.Module):
def forward(self, x, y):
with torch.enable_grad():
x = x + y
return x
model = Model()
exported_program = torch.export._trace._export(
model,
(torch.tensor(2), torch.tensor(3)),
dynamic_shapes=None,
pre_dispatch=True,
strict=False
)
```
The pass would inline the add call, but then try to construct a HOO subgraph with no inputs/outputs:
```
def forward(self):
_set_grad_enabled_1 = torch._C._set_grad_enabled(False)
```
Test Plan: Test case checking that nullcontext & no_grad wrappers lead to different export behaviors (regarding set grad subgraphs).
Reviewed By: tugsbayasgalan
Differential Revision: D55777804
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123671
Approved by: https://github.com/tugsbayasgalan
We have some (limited) support for `set_()` input mutations in `torch.compile`, but one restriction today is that we force them to run outside of the graph, in the opaque runtime epilogue.
This is a problem for ppFSDP. Why? The usage pattern of ppFSDP forward graphs look something like this:
```
def forward_fsdp(sacrificial_param, sharded_param, inp):
allgathered_param = allgather(sharded_param)
sacrificial_param.set_(allgathered_param) # hidden in an autograd.Function that we trace
out = matmul(sacrificial_param, inp)
sacrificial_param.untyped_storage().resize_(0)
return out
```
When we functionalize this graph, `sacrificial_param` sees two distinct types of input mutations, that we must preserve: a `set_`, and a `resize_`. Importantly, at runtime the `set_()` must run **before** the `resize_()`. Why? the `set_()` updates the storage of our sacrificial param to the allgather'd data, which allows the call to `sacrificial_param.resize_()` to free the allgathered data later. If we run the two mutations in reverse order, we will never free the allgathered data.
We want to put the `resize_()` mutation op inside of the graph (see next PR, also there's a much longer description in that PR for anyone interested). However, this will require us to put `set_()` in the graph as well, in order for them to run in the correct order.
In order to do this, I had to add some extra restrictions: You are now required to run `set_()` under `no_grad()` if you use it with `torch.compile`, and if you perform any other mutations to the input, those must be under no_grad as well (otherwise, the mutations may mutate the `grad_fn` of the input, making it no longer safe to keep in the graph). These restrictions are hopefully reasonable, since `set_()` doesn't see much usage today (and the original impetus for adding set_() support a few months ago was for fsdp anyway)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122981
Approved by: https://github.com/jansel
ghstack dependencies: #122433, #123646
Fixes https://github.com/pytorch/pytorch/issues/122436.
The problem was that even though we were detecting when mutations happen under no_grad or not, we were recording these mutations when they happened to the FunctionalTensor - we should really just be recording them on the underlying storage.
In particular, what would happen is that we would mutate an alias under no_grad (marking the mutation as under no_grad properly), but if we use the base tensor outside of the no_grad region, we would lazily regenerate the base at this point, propagate the mutation to the base, and at that point mark that the base witnessed a mutation (outside of the no_grad region)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122433
Approved by: https://github.com/ezyang
This addresses 2 issues with stack_trace metadata:
- stack_trace is currently missing from nodes in non-strict export
- in strict mode, stack_trace is populated for placeholder nodes, which may not be well-defined (with multiple uses)
We filter the call stack during tracing for calls from forward() methods, or ops in `torch.__init__.py` (e.g. sym_size_int, sym_constrain_range, etc.) to populate stack_trace. A node-level check is also added to _export_non_strict().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121034
Approved by: https://github.com/angelayi
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh