This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows
Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic
Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....
Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.
We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows
Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic
Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....
Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.
We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
AOTDispatch doing AOT backward graph preparation does not know real tangents that user will specify when runs backward.
AOTD guesses the tangents. Before - we guessed that memory format of tangents will be as memory format of corresponding outputs. And if specified tangents at runtime are not the same memory format as we guessed during compilation, AOTD does coercion (copy) to guessed memory_format
But as Horace found, there are popular use cases, where the outputs of compiled region will be in specific memory_format. E.g. in 4D tensor transposing dims 1 and 2.
https://github.com/karpathy/nanoGPT/blob/master/model.py#L57
This PR changes the logic, that AOTD expects the same "strideness" of tangents as outputs. As a result it will avoid coercion for the case of transposed dims.
Limitations:
We keep guessing memory_format for:
1/ Dynamic shapes (needs more changes)
2/ Tensor subclasses (needs more changes)
Other changes:
test_torchinductor was always creating contiguous tangents via `torch.randn()`, changing them to be `torch.randn_like()` to compare computation with the same strideness.
(E.g. for cuda float16 strideness affects numerics for fft ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144579
Approved by: https://github.com/bdhirsh
Bug was reported by internal user.
AOTD classified outputs that are aliases of intermediates of the graph in different categories.
...
- output is alias of intermediate which base is already output
- output is alias of intermediate which base is not in output
If we look at the fn:
```
def fn(x):
ix = x + 1
a = ix.transpose(0, 1)
return a.detach(), a
```
output 0: detach view of alias a, where a is already output
output 1: alias of intermediate ix, then additional output ix will be added internally
output 0 base is TensorAlias(a) in this case, but could be Tensor.
Adding runtime unwrapping solves this problem.
Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638
Approved by: https://github.com/bdhirsh
Fix: #141974
This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.
In summary, these are the changes this PR introduces:
- `ViewMeta` is turned into an abstract class (see
_FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
functions that need to be implemented. `to_out_index` should be
implemented by operations that might return more than 1 output.
- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
created (see _FunctionalizeFallbackKernel.h_).
- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
declaration and definition of the `ViewMeta` specializations, which
are automatically generated in the ATen codegen (see _gen.py_).
- New `_functionalization` Python sub-module is created (see
_Module.cpp_). It serves as namespace for the `ViewMeta`
specializations and `InverseReturnMode` enum.
- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
the automatically generated Python bindings for the `ViewMeta`
specialization, which are generated in the torch codegen (see
_generate_code.py_).
Note that this PR makes use of codegen at 2 different moments:
- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
@tugsbayasgalan found a bug for nested subclasses:
E.g. we have
TwoTensor(TwoTensor(t1, t2), t0).
After right_inverse we have:
rebuilt_stack == [(TwoTensor, meta, ["a", "b"]), (TwoTensor, meta, ["a", "b"])]
plain_tensors == [t0, t1, t2]
We will first put plain tensors, and only then the nested TwoTensor.
But when we unflatten:
todo = [t0, t1, t2]
we first create TwoTensor[t1, t2]
put it to todo [t0, TwoTensor[t1, t2]]
And as a result get
TwoTensor(t0, TwoTensor(t1, t2))
which is swapping original a and b :)
So the fix should be different, we need to preserve the order of elements in the stack for plain/subclasses.
I will think about the fix.
Fix:
Keep order of inner_tensor_attr_names according them added to the stack. (first - plain tensor attributes, then subclass attributes)
Test:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```
Differential Revision: [D67032477](https://our.internmc.facebook.com/intern/diff/D67032477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142481
Approved by: https://github.com/tugsbayasgalan, https://github.com/bdhirsh
This implements a new wrapper class AOTDispatchCompiler wrapper, which is just a wrapper around a callable that returns an OutputCode. We can then use it in AOTDispatch to decide whether or not to use the cache: if fw_compiler, bw_compiler and inference_compiler are all AOTDispatchCompilers, then we enable caching.
This type is pretty close to _CompiledFxGraphCallable, except it's not allowed to take any kwargs. Not sure how to consolidate the two ideas together just yet: unfortunately, there's no way to properly annotate the types to make them related. But a lot of the time, the input to this function will be a partially applied _CompiledFxGraphCallable.
This allows the PR above this one to enable AOTAutogradCache everywhere, but not increase instruction count or enable cache on unit tests that use aot_eager or other non inductor compilers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142205
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
Allow mutations mutations for subclasses that are non-contiguous.
Changes:
Removing assert in collect_metadata_analysis
Main requested testcase:
Compilation of NJT.index_put()
Adding test in test_nestedtensor.py, that compiles NJT.index_put()
It is decomposed to NJT split,unbind, which needed additional `torch._check`, `torch._check_is_size` for NJT.unbind() and guard_size_oblivious() usage in _meta_registrations and _inductor/lowering.py.
Special case:
If tangent is mutated outside of the graph, it does not participate in backward graph. Autograd in this case will set this tangent to zeros tensor.
We handle it separately in CompiledFunction.backward: not doing any processing for this tangent and broadcast to number of expected subclass unwrapped arguments.
disabling for dynamo 2 tests:
1/ For nested tensor - symbolic shapes issue on nested_tensor index operation that does splits [0, 0, 0] - there is a failure with "pending unbacked symints". This PR does not add more .tolist()/item() ops than it was before.
2/ As we do not fail with exception in collect_metadata_analysis new paths for dynamo started working and it started failing with smth strange that set_ in storage_offset (because of test for views) handling updates storage "cpu" -> "meta"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139630
Approved by: https://github.com/bdhirsh
Parametrization can not be registered for non-direct child parameters of the module.
We have to iterate through all submodules and register parametrization at every level.
Original testcase did not test the nested modules case - adding submodule to the test.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142155
Approved by: https://github.com/bdhirsh
This PR fixes the issue where AOTAutograd would produce a guard that used a symbolic value
that came from one of the input's base.
```python
@torch.compile(backend="aot_eager", dynamic=True)
def f(a, b):
a.add_(1)
b.add_(1)
return a
x = torch.ones(10)
f(x[1:], x[1:])
```
In the example above, AOTAutograd functionalizes the mutation by making use of
`as_strided_scatter` operation, which produces the guard: `s0 >= s1 + 1`, where:
- `s0`: corresponds to `x.size()[0]`
- `s1`: corresponds to `a.size()[0]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139554
Approved by: https://github.com/bdhirsh
FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things:
- It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this.
- It refactors the code to be way more clear about the constants we're using and when we're using them.
Basically, there are two possible sets of constants we can grab from the compiled fx graph.
1. If freezing is turned off, we save the constants directly in CompiledFxGraph.
2. If freezing is turned on, we save the *names* of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch.
We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed.
This makes it clear that the gm passed to AOTAutogradCache is *not* part of post compile, only the cache key generated from it is.
The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing.
Will add a specific AOTAutogradCache to confirm we bypass freezing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897
Approved by: https://github.com/ezyang, https://github.com/masnesral
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.
This also moves two helper functions to output_code.py as well.
Also set _boxed_call at constructor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
Co-authored-by: James Wu <jjwu@meta.com>
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.
This also moves two helper functions to output_code.py as well.
Also set _boxed_call at constructor.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
Co-authored-by: James Wu <jjwu@meta.com>
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731
Introducing ability for subclass implement type convertion to expected_type.
```
def __coerce_same_metadata_as_tangent__(
self, expected_metadata: Any, expected_type: Optional[Type] = None
):
```
Here if `expected_type=None` means `SubclassClass` is expected.
E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime
Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
Reason:
Currently we have multiple traversals for tangents in runtime:
- To check that types and structure are identical to what we guessed during tracing time
- Coerce metadata
- Coerce memory_format
- Unwrap_tensor_subclass
All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses.
Change:
To do everything in one traversal at runtime (including flattening)
Implementation details:
Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too.
Preparing memory_format is optional (controlled by with_memory_format=True).
2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068
Approved by: https://github.com/bdhirsh
Original Issue: https://github.com/pytorch/pytorch/issues/134644
We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing.
=>
Tracing time:
- Store trace_tangents_memory_formats in metadata
- Coerce tangents to deduced memory_format
Runtime:
- Coerce tangents to tracing memory format from metadata
Subclasses logic:
- Previously coercing tangents logic did not handle nested subclasses case, fixing this.
For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass:
[subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ]
If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure.
The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method).
Other small change:
Remove duplicated not-related comment.
Testing
```
python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous
```
Benchmarking:
After change:
```
└─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous
Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms
Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms
```
Before change:
```
BEFORE_CHANGE SUBCLASS 4.1194
```
No siginificant changes in processing time.
(We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225
Approved by: https://github.com/bdhirsh
Cleanup:
1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper
2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat.
Verfified fbcode tests compiled_optimizers
Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549
Approved by: https://github.com/bdhirsh
Original issue:
https://github.com/pytorch/ao/issues/890
The problem:
TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.
flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.
This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.
Test Plan: (Meta only tests)
Reviewed By: aorenste
Differential Revision: D62384944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
Original issue:
https://github.com/pytorch/ao/issues/890
The problem:
TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.
flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh