Add a few missing returns in torch._logging and use ruff to infer the obvious ones.
LazyStr now properly checks the return type of the Callable and the args and kwargs passed to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155345
Approved by: https://github.com/ezyang
Support for `call_module` in `copy_paste_aot_backward_graph` added recently with PT2.7
Problem is being observed with HPU backend in example repro due to creating fused modules.
```
import torch
device = 'cpu' #'hpu'
backend = 'inductor' #'hpu_backend'
def fn(t1):
t1 = t1 * 1
t1_grad = torch.ones_like(t1, device=device)
t1.backward(t1_grad, retain_graph=True)
return t1
t1 = torch.ones(1, requires_grad=True, device=device) #.squeeze()
compiled_fn = torch.compile(fn, backend=backend)
result = compiled_fn(t1)
with torch._dynamo.compiled_autograd._enable(torch.compile(backend=backend)):
result_grad = torch.ones_like(result, device=device)
result.backward(result_grad)
print(f'{result_grad=}')
print(f'{t1.grad=}')
```
With this change I'm getting same results like on CPU, however I'm facing below problem when running with scalar (t1 tensor after squeeze):
`torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function getitem>(*(FakeTensor(..., device='hpu:0', size=()), 0), **{}): got IndexError('invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number')`
While on CPU there's following warning and None returned:
`repro.py:23: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at pytorch/build/aten/src/ATen/core/TensorBody.h:489.)
print(f'{t1.grad=}')
t1.grad=None`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153827
Approved by: https://github.com/xmfan
Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager").
Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222
Approved by: https://github.com/jansel
ghstack dependencies: #153193
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).
Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #149707
We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181
Approved by: https://github.com/xmfan
we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014
Approved by: https://github.com/jansel
ghstack dependencies: #149064
FIXES https://github.com/pytorch/pytorch/issues/137372
sometimes, the aot bwd is lowered lazily. so the bw_module we saved in CompiledFunction._lazy_backward_info hasn't gone through post grad passes, specifically the view_to_reshape pass. Running that directly will then sometimes error, because the AOT forward has already changed its views to reshapes, and it is reflected in the gradients we see in CA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149030
Approved by: https://github.com/bdhirsh
ghstack dependencies: #148799
i'm changing CA initial trace to always trace as dynamic, fixes these errors:
```python
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.2139s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_autograd.py TestAutogradWithCompiledAutograd.test_autograd_python_custom_function_inplace
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0057s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_autograd.py TestAutogradWithCompiledAutograd.test_copy_slices_graph_task_updates
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.9662s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_inplace_on_view_weak_grad_fn - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_autograd.py TestAutogradWithCompiledAutograd.test_inplace_on_view_weak_grad_fn
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0077s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_leaf_assignment - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_autograd.py TestAutogradWithCompiledAutograd.test_leaf_assignment
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [5.0485s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_setitem_mask - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_autograd.py TestAutogradWithCompiledAutograd.test_setitem_mask
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0102s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_tensor_hooks_inplace_over_view - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_autograd.py TestAutogradWithCompiledAutograd.test_tensor_hooks_inplace_over_view
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148799
Approved by: https://github.com/jansel, https://github.com/zou3519
## Before
Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.
## After
We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
def unpack(self):
if self.hook:
return self.hook(self.packed_data)
else:
return self.packed_data
# This approach won't directly work when we add support for Forward AD or double-backward.
```
Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.
All tests pass when running the CA graph directly, the remaining issues are in Dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
## Before
Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.
## After
We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
def unpack(self):
if self.hook:
return self.hook(self.packed_data)
else:
return self.packed_data
# This approach won't directly work when we add support for Forward AD or double-backward.
```
Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.
All tests pass when running the CA graph directly, the remaining issues are in Dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
The previous PRs built up to this. We change compiled autograd's initial
trace to stop baking in metadata.
While tracing, we allocate some weirdly shaped tensors that we can put
proxies on. The initial trace should not be accessing any metadata of
these tensors (it will likely error out if it does because of how weird
the shapes are).
This involved fixing some various sites where we do specialize on the
metadata, like:
- we change CopySlices's apply_with_saved to proxy some calls
into the graph (this change is fairly hard to split out by itself).
- we stop calling InputBuffer::add
- we delete the weird metadata from the graph so that no graph passes
can make use of it.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387, #143405
We will always proxy autograd.Function nodes in compiled autograd's
initial graph capture (previously there was an
option to proxy vs trace into the autograd.Function)
We have some requirements for the AOTBackward. Compiled Autograd runs
accumulate grad reordering passes on the AOTBackward graph directly
after the initial graph capture, so we can't just proxy a single node for it.
Instead, we:
- proxy the AOTBackward prologue function into the CA graph
- copy-paste the AOTBackward graph into the CA graph
- trace directly through the epilogue (the traced nodes go into the CA
graph).
Tracing through the epilogue is safe (assuming no Tensor subclasses)
because the only thing the epilogue does is drop some outputs. The
Tensor subclass situation was already broken so this doesn't regress
anything but this PR sets it up to be fixed (in a followup, where we
will proxy "make_subclass" calls into the graph from the epilogue).
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143405
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387
We define a functional version of a C++ torch::autograd::Function. The
functional version reconstructs the ctx object and then calls
backward with it.
Some more details:
- we define how to pack/unpack ctx.saved_data into an IValue. It's a
Dict[str, IValue], so it wasn't difficult.
- every call to CppNode::apply_with_saved binds a new function to
Python. This is because we're unable to reuse the a previously bound
function for reasons (the schema may change depending on what the user
actually puts into their Dict[str, IValue]).
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143387
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304
This PR is on the way to getting compiled autograd's initial capture to
stop specializing on Tensor metadata.
This PR changes compiled autograd's initial capture to proxy an opaque
(w.r.t. Dynamo) function into the graph for all built-in codegen'ed
autograd nodes and validate_outputs.
We changed each codegen'ed apply_with_saved (e.g.
MulBackward0::apply_with_saved) to call into Python to proxy a function
(compiled_autograd.ops.MulBackward0) into the graph. Then, we use the
node's InputMetadata to "guess" at the properties of the output Tensors
to create some new FakeTensors.
Some details:
- MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be
call to Python via libtorch_python. There is an indirection
(PyCompilerInterface) to do this.
- MulBackward0::apply_with_saved passes a C++ function to Python. To make
our lives easier, every codegen'ed apply_with_saved passes a C++
function with the same signature
`(variable_list, ivalue_list) -> variable_list`.
- We define how to pack arbitrary C++ types into IValue via a helper
IValuePacker struct and codegen functional variants of each builtin
C++ autograd node (e.g. MulBackward0_apply_functional_ivalue).
MulBackward0 before this PR:
https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de
MulBackward0 after this PR:
https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296
Approved by: https://github.com/jansel
This error started popping up in HUD CA benchmarks:
```python
File "/data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py", line 371, in dce
self.fx_tracer.graph.eliminate_dead_code(is_impure)
File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1862, in eliminate_dead_code
self.lint()
File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1753, in lint
raise RuntimeError(f"Node redefined name {node.name}!")
RuntimeError: Node redefined name aot0_expand!
```
We added CA initial capture's renaming (https://github.com/pytorch/pytorch/pull/133148) to help debug issues with AOT backward, but it errors out when we have multiple instances of the same AOT backward. This likely only showed up now because of increased hierarchical graph reuse. I fix it by adding a postfix counter to the node name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144202
Approved by: https://github.com/bdhirsh, https://github.com/jansel