Commit Graph

137 Commits

Author SHA1 Message Date
Brian Hirsh
ba86dfcd83 AOTDispatch subclass (#104483)
This is a PoC of AOTDispatch support. This PR actually works on basic examples, and I'm working on testing it out on `DTensor` (with @wanchaol), `SemiStructuredSparsityTensor` (with @jcaip), and `FP8Tensor`.

There are some design decisions baked into the PR that I think we need consensus on though - so I'm planning on writing a larger design doc to go over the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104483
Approved by: https://github.com/ezyang
2023-10-10 16:13:16 +00:00
chilli
201d02ef77 stop non-differentiable values from being materialized in aotautograd (#110721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110721
Approved by: https://github.com/bdhirsh
ghstack dependencies: #110720
2023-10-09 20:18:19 +00:00
chilli
c596db762f refactor aotautograd to set requires_grad on info rather than a separate array (#110720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110720
Approved by: https://github.com/bdhirsh
2023-10-09 20:18:19 +00:00
Kazuaki Ishizaki
b5f9696d81 Fix typo under torch directory (#110824)
This PR fixes typo `the the` of comments and exception messages in files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824
Approved by: https://github.com/H-Huang
2023-10-09 19:16:43 +00:00
chilli
6d23193aab Added strict=True to zip in aot_autograd (#110668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110668
Approved by: https://github.com/ezyang
ghstack dependencies: #110501, #110504, #110591
2023-10-06 05:12:05 +00:00
Brian Hirsh
b457e3f79a Reland attempt 2 of "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)" (#110079)
The first reland broke internal (failing diff: D49617462).

The major error looks like it's because there's an internal-only higher order op that needs a new functionalization rule. I'm going to land an internal diff for that and confirm tests pass before relanding this PR.

Also confirmed that the issue from https://github.com/pytorch/pytorch/issues/110121 is fixed, and added a test.

This reverts commit 1b90f07f5a.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110079
Approved by: https://github.com/ezyang
2023-10-03 18:50:25 +00:00
PyTorch MergeBot
1b90f07f5a Revert "Reland "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)"
This reverts commit d0fe8fa5db.

Reverted https://github.com/pytorch/pytorch/pull/109906 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/109906#issuecomment-1735416852))
2023-09-26 12:10:25 +00:00
Brian Hirsh
d0fe8fa5db Reland "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)
I'm pretty sure this is fixed but I'll run inductor and trunk CI. The failing test in trunk previously was that the selective activation checkpointing code that landed recently assumes that it can detect whether or not AOTAutograd is running by seeing if the inputs to SAC are C++ `FunctionalTensorWrapper`s

previous land broke some inductor trunk tests

This reverts commit 629a628cc8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109906
Approved by: https://github.com/ezyang
2023-09-25 14:53:54 +00:00
PyTorch MergeBot
629a628cc8 Revert "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)"
This reverts commit b5d6e831a9.

Reverted https://github.com/pytorch/pytorch/pull/106406 on behalf of https://github.com/malfet due to Broke lots of tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/106406#issuecomment-1731524917))
2023-09-22 14:32:34 +00:00
Brian Hirsh
b5d6e831a9 Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)
Now that FunctionalTensor and `FunctionalTensorMode` are lower down in this stack, the changes in this PR are more mechanical: Everywhere in AOTAutograd that I used to use the C++ functionalization API, I now use the python functionalization API.

Note that this doesn't actually cause functionalization to run underneath torch_dispatch. I'm saving that re-ordering for later in the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106406
Approved by: https://github.com/ezyang
ghstack dependencies: #108654, #109662, #109632, #109023
2023-09-22 07:09:04 +00:00
Brian Hirsh
25e81f19f3 reland "python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)" (#109518)
Reland - the previous PR was reverted by internal with this error:
```
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/363cd7e240f5d021/caffe2/torch/fb/trainer/data_modules/tests/__test_dataloader__/test_dataloader#link-tree/torch/__init__.py", line 29, in <module>
    from ._utils_internal import _functionalize_sync as _sync
ImportError: cannot import name '_functionalize_sync' from 'torch._utils_internal'
```

I couldn't figure out why internal was unhappy with the import. One potential reason is that I see a build rule for *another* `_utils_internal.py` in the fb folder here ([link](https://www.internalfb.com/code/fbsource/[30ed85cd88409af98b7490be137aaa5dfd7afd01]/fbcode/caffe2/TARGETS?lines=444))

Rather than burn more time investigating, I confirmed internally that the error goes away if I move the util from `torch/_utils_internal.py` to `torch/_utils.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109518
Approved by: https://github.com/albanD
2023-09-19 13:25:24 +00:00
PyTorch MergeBot
49b18ae546 Revert "python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)"
This reverts commit 0ad595954a.

Reverted https://github.com/pytorch/pytorch/pull/107917 on behalf of https://github.com/clee2000 due to breaking internal builds D49346637 ([comment](https://github.com/pytorch/pytorch/pull/107917#issuecomment-1722566885))
2023-09-17 20:57:41 +00:00
Brian Hirsh
0ad595954a python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)
Added two new utils to help with turning python functionalization on in AOTAutograd (next PR):

(1) updated `torch._sync()`. Previously, this API could only handle `torch.Tensor` instances that had a `FunctionalTensorWrapper` TensorImpl. It now needs to handle python `FunctionalTensor`'s. In theory I can probably break BC and change this API (since it's private?), but I decided not to do it in this PR stack do minimize the chance of reverts. Instead of updating that API directly (which is in C++), I just added a python shim that first tries to unwrap the python `FunctionalTensor` if there is one, then calls the existing C++ logic

(2) `mirror_autograd_meta` is now a standalone API that tries to mirror the `requires_grad` and `is_leaf` autograd metadata from one tensor to another. Previously this was hardcoded into `torch._to_functional_tensor()`. But I now need to use it in a more standalone way: later in AOTAutograd when we unwrap and re-wrap a tensor subclasses, we need to manually mirror the autograd metadata from the original to the updated version of the subclass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107917
Approved by: https://github.com/ezyang
ghstack dependencies: #106404
2023-09-15 20:19:25 +00:00
soulitzer
3cc5c42a23 Fix aot sequence_nr to reset bwd flag (#107210)
The way the aot autograd sequence_nr tracking works is that we run the aot export logic, the dynamo captured forward graph is run under an fx.Interpreter, which iterates through the nodes of the forward graph while setting the `current_metadata`.
Since during backward what is run doesn't correspond to any node during forward, we fallback to the global `current_metadata`. And since this global metadata is ends up being shared between runs, that leads to weirdness if we forget to reset things, e.g., depending whether this is the first test run, the printed results will be different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107210
Approved by: https://github.com/bdhirsh
2023-08-24 16:58:12 +00:00
Brian Hirsh
8c44cfef5e Add some support for detecting false aliasing in AOTAutograd (#106461)
This is a partial fix for https://github.com/pytorch/pytorch/issues/106457. In the examples with the shampoo optimizer that i ran, they were enough to remove the parameter aliasing in shampoo.

I added some new logic for detecting if two inputs have overlapping memory in specific cases: if they're both 2D tensors with stride 1. In that case (the case for shampoo), I try to compute a bunch of contiguous intervals on the two tensors, and check if any of the intervals overlap. In theory this is slow, since if our two tensors are e.g. of size (256, N), we'll need to create 256 intervals to check for overlap on. This seems... probably fine, since I think we do more egregious things in the compile stack to cause slowness. Open to suggestions though!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106461
Approved by: https://github.com/albanD
ghstack dependencies: #106460
2023-08-15 17:27:37 +00:00
Brian Hirsh
517ba2add7 AOTAutograd: allow input mutations on inputs that are non-contiguous (#106460)
Fixes https://github.com/pytorch/pytorch/issues/106456

I also had to update the logic in functionalization's resize_() kernel to convey to AOTAutograd that resize_() is a metadata mutation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106460
Approved by: https://github.com/ezyang
2023-08-15 17:27:37 +00:00
Alex Settle
9ba0558d48 Add sequence_nr to aot_autograd to map forward ops to their corresponding backward ops (#103129)
Fixes #102375

Sequence_nr increments in the forward pass and decrements in the backward pass.  Backward ops with the same sequence_nr as a forward op represent the backward implementation for the op.  The long term goal is to make this information available to the profiler so users can observe which ops are fused by the inductor openai triton kernels.

Added a test for this feature **test/dynamo/test_aot_autograd.py::AotAutogradFallbackTests::test_aot_sequence_nr**.  The test case uses **aot_export_module()** to create a joint fwd/bwd fx graph.  Then it walks all the nodes in fx graph using fx_graph.graph.nodes.   The seq_nr of each node is recorded in node.meta.  During the fwd pass the seq_nr increments and it decrements during the bwd pass.  This allows the user to map forward ops to their corresponding bwd ops which is useful for performance analysis.

Expected output from the test case.

 SeqNr|OrigAten|SrcFn
0|aten.convolution.default|l__self___conv1
0|aten.add.Tensor|l__self___bn1
1|aten._native_batch_norm_legit_functional.default|l__self___bn1
2|aten.relu.default|l__self___relu1
3|aten.add.Tensor|add
4|aten.view.default|flatten
5|aten.t.default|l__self___fc1
6|aten.unsqueeze.default|l__self___fc1
7|aten.mm.default|l__self___fc1
8|aten.squeeze.dim|l__self___fc1
9|aten.add.Tensor|l__self___fc1
10|aten.sub.Tensor|l__self___loss_fn
11|aten.abs.default|l__self___loss_fn
12|aten.mean.default|l__self___loss_fn
12|aten.ones_like.default|
12|aten.expand.default|
12|aten.div.Scalar|
11|aten.sgn.default|
11|aten.mul.Tensor|
8|aten.unsqueeze.default|
7|aten.t.default|
7|aten.mm.default|
7|aten.t.default|
7|aten.t.default|
7|aten.mm.default|
6|aten.squeeze.dim|
5|aten.t.default|
4|aten.view.default|
2|aten.threshold_backward.default|
1|aten.native_batch_norm_backward.default|
0|aten.convolution_backward.default|
0|aten.add.Tensor|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103129
Approved by: https://github.com/soulitzer
2023-08-02 00:52:52 +00:00
Brian Hirsh
4a549dd57a AOTAutograd: correctness fix when tracing custom autograd functions that alias inputs (#102992)
Fixes https://github.com/pytorch/pytorch/issues/102970. See the comment [here](https://github.com/pytorch/pytorch/issues/102970#issuecomment-1577223773) for details.

We normally treat "outputs that alias inputs" specially in AOTAutograd, by replaying the views at runtime, instead of baking them into the graph. For views that are part of custom autograd functions though, we can't do that view-replay, since it will clobber the backwards function that the user specified in their custom autograd.Function.

Right now in this PR, I distinguish between "aliased inputs that are normal views" vs. "aliased inputs that are views that came from an autograd.Function call" by checking the outputs `.grad_fn` field, to see if it inherits from our custom CBackward function class. Then I added a new `OutputType` enum value, that we effectively treat the "normal" way (the same way that we treat ordinary, non-aliased outputs). The new enum val is mostly for debugging - so we can print it and know that our graph had custom autograd.Function aliased outputs in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102992
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-07-31 19:02:12 +00:00
PyTorch MergeBot
48cd8e29c1 Revert "Slightly improve AOTAutograd logging with ViewAndMutationMeta (#105702)"
This reverts commit cc137342d0.

Reverted https://github.com/pytorch/pytorch/pull/105702 on behalf of https://github.com/PaliC due to breaking internal export tests (relevant details shared with author) ([comment](https://github.com/pytorch/pytorch/pull/105702#issuecomment-1650492077))
2023-07-25 20:17:27 +00:00
Edward Z. Yang
cc137342d0 Slightly improve AOTAutograd logging with ViewAndMutationMeta (#105702)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105702
Approved by: https://github.com/albanD
2023-07-25 00:47:38 +00:00
Jason Ansel
c902b84e0b Compiled autograd (#103822)
This branch:
1) converts the autograd tape into an FX graph
2) caches that conversion using a "shadow" graph
3) compiles and runs the generated FX graph instead of the normal autograd

What works currently:
1) Caching, capture, and initial integration
2) Backwards hooks
3) Inlining AotAutograd generated subgraphs
4) torch.compiling the generated FX graph
5) Auto-detecting dynamic shapes based on changes

Future work
1) Larger scale testing
1) Boxed calling convention, so memory can be freed incrementally
1) Support hooks on SavedTensor
1) Additional testing by running eager autograd tests under compiled_autograd.enable()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103822
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-24 21:12:05 +00:00
Edward Z. Yang
2fa7d11b64 Immediately compile backwards graph in AOTAutograd if dynamic shapes (#104971)
Previously, we made backwards graph compilation lazy to avoid paying
for compilation if the user didn't actually end up using the backwards
graph.  This was useful in the old days when a lot of things in Inductor
didn't work and we could bypass errors this way.

However, this has a bad implication for dynamic shapes: the backwards
graph compilation can trigger extra guards, which are too late to
install in the Dynamo context if we wait until backwards is being run.
So in this PR I move us back to compiling backwards graph immediately
if we capture any SymInts for backwards.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104971
Approved by: https://github.com/Chillee
2023-07-17 15:37:17 +00:00
Edward Z. Yang
10cbc9a063 Enable cuda graphs for dynamic shapes (#105064)
The general idea is to do a separate CUDA graph for each size. Because of cuda graph trees, these graphs will all share the same memory pool, so your memory usage will only be the worst case memory usage of the biggest dynamic size you want. This requires an extra dispatch in the cudagraphified callable. You must pay for a CUDA graph recording for every dynamic size you encounter, but this is MUCH cheaper than running the entire PT2 compile stack, so I expect you to still see benefits.

This was surprisingly easy to do.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105064
Approved by: https://github.com/voznesenskym
2023-07-14 16:13:50 +00:00
Edward Z. Yang
979f826015 Read out real strides from compilation result, rather than real args (#105010)
This prefigures a refactor that will move the backward compilation
to entirely ahead of time, so I need to extract these strides some
other way.  Straight from the compiler's mouth will do it.

I can't easily get the information via the return result of `fw_compiler` without changing the calling convention, so instead I smuggle it via TracingContext. TracingContext may be None when we are compiling patterns for the joint graph pattern matcher.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105010
Approved by: https://github.com/shunting314
2023-07-12 11:33:08 +00:00
Edward Z. Yang
9d1f5a35df Move more stuff into ViewAndMutationMeta (#105009)
The one sort of tricksy thing about this PR is that `num_symints_saved_for_bw` is populated later; we compute the metadata with a forward pass, but we only know `num_symints_saved_for_bw` once we run partitioning. This seems... fine.

Also, by pushing the conditionals into the slices, I can remove the top level if...else branch, for a nice simplification.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105009
Approved by: https://github.com/albanD
2023-07-12 02:22:44 +00:00
Aaron Gokaslan
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
Jason Ansel
dffcf999bd Misc changes from compiled autograd branch (#104316)
This PR pulls out some standalone changes from #103822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104316
Approved by: https://github.com/ezyang
2023-07-08 20:59:20 +00:00
Yukio Siraichi
d0a72ec5e4 Translation validator for dynamo guards. (#102563)
This PR introduces a translation validator for dynamo guards. In summary, it verifies
whether the guards issued as Python code are sound, w.r.t the initial guards.

The main changes in this PR are:

- Create an FX graph for dynamic shapes
- Translate "the original" guards from the FX graph to Z3
- Check if the guards produced by `produce_guards` are sound w.r.t. the ones from the FX graph

gh-stack version of the PR #101146.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102563
Approved by: https://github.com/ezyang
2023-06-28 22:32:53 +00:00
Shunting Zhang
98f00f881f [inductor] convert layout of conv weight ahead of time for inference (#103642)
This PR handles inference. Will do similar thing for training later.

Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one).
- convmixer: 4.285x -> 4.309x
- resnet50: 2.170x -> 2.203x

The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing.

Commands
```
TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103642
Approved by: https://github.com/eellison
2023-06-28 17:42:32 +00:00
Brian Hirsh
106d3f0115 [AOTAutograd] make _unsafe_view() logic happen during the runtime epilogue (#103919)
Fixes https://github.com/pytorch/pytorch/issues/103153

AOTAutograd has some logic for handling the case when we have:
* a graph output that is a view of an intermediate
* None of the other aliases of that output escape the graph, so from the perspective of the user + the autograd engine, we can pretend that the output is not a view

However, that logic would inject an `_unsafe_view()` call into the graph at trace time. This isn't wrong, but inductor will just immediately decompose `_unsafe_view()` into `view()`, and so the output tensor will continue to show up as having view metadata w.r.t. autograd.

This PR changes the `unsafe_view()` call to be in the runtime epilogue, instead of being part of the graph (where the compiler might do bad things to it - the compiler also shouldn't have to concern itself with autograd metadata).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103919
Approved by: https://github.com/ezyang
2023-06-21 14:37:35 +00:00
ShuaipengLi
df814484f4 remove dynamo fake param/buf check (#103574)
Fixes #103569
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103574
Approved by: https://github.com/ezyang
2023-06-16 14:19:37 +00:00
Thiago Crepaldi
6f655d4195 Add symbolic tracing support to torch._dynamo.export (fake input + weights) (#100017)
Fixes #95900
Using the following repro as guide:

```python
import torch
import torch._dynamo
from torch._subclasses import fake_tensor
from torch.fx.experimental.symbolic_shapes import ShapeEnv
from torch._dynamo.output_graph import config
class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = torch.nn.Linear(2, 2)
        self.linear2 = torch.nn.Linear(2, 2)

    def forward(self, x):
        out = self.linear(x)
        out = self.linear2(out)
        return out

fake_mode = fake_tensor.FakeTensorMode(allow_non_fake_inputs=False,
                                       allow_fallback_kernels=True,
                                       shape_env=ShapeEnv(
                                            allow_scalar_outputs=config.capture_scalar_outputs,
                                            allow_dynamic_output_shape_ops=config.capture_dynamic_output_shape_ops,
                                            frame_id=0
                                        ),
)
# Fakefying input/model before calling torch._dynamo.export
with fake_mode:
    fake_x = torch.rand(5, 2, 2)
    model = Model()

# Calling torch._dynamo.export without active fake mode
graph_module, guards = torch._dynamo.export(
    model,
    fake_x,
    aten_graph=True,
    fake_mode=fake_mode
)
graph_module.print_readable()
graph_module.graph.print_tabular()
```

Summary of changes:

    * Plumb fake_mode through torch.export API. When specified, it
    replaces the creation of a new FaketendorMode at InstructionTranslator on behalf of OutputGraph
     Hacks FakeTensor.__new__ to prevent a
    torch.tensor._make_subclass call for inputs that are already fakefied by
    user. This probably need to be fixed in a nicer way. Any idea?
    * Removed a few asserts that didn't want faked tensors coming
    from user script
    * Added torch._subclasses.fake_tensor.FakeTensor to type list on a few
    asserts check to allow fake inputs

The changes above allowed symbolic tracing with both static and dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100017
Approved by: https://github.com/ezyang
2023-06-15 21:28:10 +00:00
Elias Ellison
d083d444ff Inductor Freezing (#100652)
Adds a freezing pass that will constant fold parameters in inductor `config.freezing`. This occurs post functionalization in aot autograd to capture both dispatching and allow passes to occur post functionalization. A few notes:

- There is an option to discard parameters `config.freezing_discard_parameters` which will take the current eager modules and wrap parameters to a Tensor subclass which will error if used.
- I needed to expose flat_params in aot_autograd in order to discard old references when we constant fold away parameters, like with amp. I also exposed `fw_metadata` to avoid constant folding mutated paraemters.
- Caching parameter transformations/constant folding across different inferences nyi
- Checking version_counter of constant folded params nyi

I'm not really sure what the actual naming should be. In jit there was both "freezing", which was platform agnostic, and "optimize for inference", which made device specific optimizations. We're doing the latter here but maybe freezing is a better name.

Differential Revision: [D46244033](https://our.internmc.facebook.com/intern/diff/D46244033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100652
Approved by: https://github.com/jansel
2023-06-12 20:56:03 +00:00
Edward Z. Yang
54daf870bc CUDA graphs overrides dynamic shapes and forces specialization (#103290)
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs.  This new strategy
I think is better.  The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input.  When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.

This obsoletes https://github.com/pytorch/pytorch/pull/101813

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym, https://github.com/eellison
2023-06-12 20:26:55 +00:00
Shunting Zhang
daf75c0759 [AOTAutograd] compare with stride hints (#103342)
We previously compare FakeTensor's strides with real tensor's strides. This cause dynamic dimension of FakeTensor being specialized to static int. This may cause a graph specialized for one shape being used by another shape which is wrong.

Use stride hints for the comparison instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103342
Approved by: https://github.com/malfet
2023-06-10 06:51:54 +00:00
Shunting Zhang
86c7652503 [inductor] layout optimization for conv (#99773)
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773
Approved by: https://github.com/jansel
2023-06-02 21:08:18 +00:00
Brian Hirsh
f22148f0ed aotautograd: fix mutation bug when input is noncontiguous (#102767)
Fixes https://github.com/pytorch/pytorch/issues/93363.

See the comment here for details: https://github.com/pytorch/pytorch/issues/93363#issuecomment-1572647261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102767
Approved by: https://github.com/ezyang
2023-06-02 14:31:06 +00:00
Richard Zou
74f10b9ea5 Switch most Python RAII guard usages to context manager (#102642)
There are some I can't easily switch due to reasons like:
- Dynamo modelling the guard
- BC concerns (for torch.autograd.set_multithreading_enabled)

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102642
Approved by: https://github.com/albanD
2023-06-01 16:28:37 +00:00
Edward Z. Yang
dcf0c5fb6e Use safe_is_leaf to test leafness (#102706)
This fixes one of the problems in https://github.com/pytorch/pytorch/issues/101160#issuecomment-1570376548
but I don't have a test case because the full example is fairly
difficult to minify.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102706
Approved by: https://github.com/bdhirsh
2023-06-01 16:02:12 +00:00
Animesh Jain
9c4fd72b53 [aot_autograd][functional_rng] Change calling convention (#102344)
Key change - seed, offset are the last 2 args in both the fwd and bwd graphs
Reason - The cudagraphs implementation in inductor currently relies on very simple ordering guarantees i.e. first n inputs are static for both fwd and bwd graphs. In the current implementation of functionalization of rng ops, this assumption is broken because the first 2 inputs are seed, offset.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102344
Approved by: https://github.com/eellison
2023-05-26 21:27:20 +00:00
Tugsbayasgalan Manlaibaatar
b5ee34e5f2 Disallow module forward input mutation in aot_export (#101834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101834
Approved by: https://github.com/bdhirsh
2023-05-20 05:41:01 +00:00
Brian Hirsh
ee40cce475 [AOTAutograd] add export entrypoints (#100587)
The main addition in this PR is two new API's in AOTAutograd.

**APIs**

`aot_export_module`: Given a module, exports it into a functionalized FX graph. Returns an `fx.GraphModule`, `GraphSignature` pair. The `GraphSignature` tells you various information about the graph, such as which graph inputs correspond to module params/buffers (and their fqn's), how to pytree-ify the inputs and the outputs of the graph. If you specify `trace_joint=True`, then you'll get back a joint forward-backward graph, that also returns parameter gradients in addition to the user outputs.

There are several restrictions on this API, detailed in the comments. The most notable one is probably that this API does not handle partial graphs: If you want a backward graph, then you module's forward function is **required** to return a scalar loss that we can backprop through. It also does not support capturing the optimizer step.

I (gratefully) used @SherlockNoMad and @suo's internal version of the `GraphSignature` object for this API, with a few minor changes in order to integrate it into AOTAutograd.

`aot_export_joint_simple`: Given a function, we'll trace it into a joint forward-backward graph and return it. Unlike the above API, the function is **not** required to return a scalar loss. However, this API makes the guarantee that you **do not** need to make any calling convention changes between the original function, and the exported one, provided that you do that you do the following:
* If you pass `trace_joint=False`, no work is needed: we'll export a functionalized forward graph with the same set of inputs as the original function
* If you pass `trace_joint=True`, then you will need to manually use the `default_partitioner` or `min_cut_partitioner` from functorch. If you do, and get back a fw and bw graph, then the forward graph will be runnable identically to the original user function.

The main use case for this API is higher order ops: a higher order op like `torch.cond()` can implement its derivative formula by using this API to export a joint graph (for both the true subgraph and the false subgraph), partition it into a fw/bw graph, and run cond on the `true_bw`, `false_bw` subgraphs. cc @zou3519 @Chillee

**Implementation Strategy**

A lot of the work in this PR went in to trying to find a reasonable way to re-use existing AOTAutograd components to expose these API's. Concretely:

* The two new API's are both thin wrappers around `_aot_export_function`: this is a general purpose export API, that just re-uses `create_aot_dispatcher_function`. If we want to add e.g. an export API that includes the optimizer step in the future, we could probably implement it using `_aot_export_function`.
* `aot_export_module` works extra hard to re-use as much of AOTAutograd as possible. For example, when tracing an inference graph, I perform the export under `torch.no_grad()` to make sure we don't accidentally trace out a backwards graph. When exporting a joint graph, I manually `.detach()` all user outputs except the loss, to make sure that we don't accidentally compute gradients for any other user outputs (even if the user forgot to manually detach them).
* A large portion of `aot_export_module` comes from parsing out and creating a `GraphSignature` object. We discussed a few weeks ago that there's potentially a lot more information that we could stuff into this object (see [doc](https://docs.google.com/document/d/1_qzdKew5D1J2Q2GkZ1v5jsczSsIU-Sr0AJiPW7DdGjE/edit?usp=sharing)). For now, I ended up deciding to support the more limited use case of exporting a fwd-bwd full graph, without some of the extra annotations in that doc (for example, if we were to export partial graphs, we would need annotations for saved activations). My thought is that once a more concrete use case comes up that the existing API doesn't satisfy, we can revisit the annotations then.
* I factored out `create_functional_call()` and `create_tree_flattened_fn()` for pytree-flattening and lifting-params-and-buffers, since I also need them in the export code
* I added an `AOTConfig.is_export` flag. The export API re-uses all of the same code paths as the rest of AOTAutograd, but there are a few points where we need to either exit early (and avoid making a runtime epilogue), or add extra error checking, that is only valuable for export.
* `aot_dispatch_autograd()` now exits early if it's being called in an export context, so it returns the full graph instead of also trying to create an `autograd.Function`. I think we probably want to factor this out, although I figured it would be safer to wait a bit for clarity on how functional RNG works with export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100587
Approved by: https://github.com/ezyang, https://github.com/SherlockNoMad
2023-05-15 18:08:11 +00:00
Brian Hirsh
bba12a4668 aot_autograd: factor out runtime epilogue from aot_dispatch_base (#100586)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100586
Approved by: https://github.com/ezyang
2023-05-15 18:08:11 +00:00
Brian Hirsh
a6b8e69d36 [aot autograd] fix de-dupping metadata computation bug (#100431)
Fixes https://github.com/pytorch/pytorch/issues/100224

There was a bug in the way that metadata was computed when going from "metadata before-removing-dupes" to "metadata after-removing-dupes". In fact, when I ran the repro with `functorch.config.debug_assert = True`, that immediately signaled to me that the metadata was incorrect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100431
Approved by: https://github.com/ngimel, https://github.com/albanD
2023-05-12 00:50:35 +00:00
Brian Hirsh
5651006b9d [aot_autograd] proper handling for when outputs are aliased but have identical size/stride/offset metadata (#100430)
Fixes https://github.com/pytorch/pytorch/issues/100348, see the discussion in the issue for details. The problem was that for code like this:
```
def f(x):
    out = ...
    return out, out.detach()
```

The `.detach()` would turn into a `.alias()`, and inductor turns `.alias()` calls into no-ops. Inductor would effectively see that the two graph outputs have the same metadata, and return `out, out`. cc @ngimel alternatively we could have inductor try to detect when it's not ok to make `.alias()` a no-op, but that would probably require some custom logic instead of making `.alias()` a decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100430
Approved by: https://github.com/ngimel
2023-05-12 00:50:35 +00:00
Edward Z. Yang
2621fbda7d Turn on anomaly detection for AOTAutograd backward tracing (#101047)
Previously, anomaly detection was only enabled on the inner forward function, and not on the overall joint function that calls backward. I believe this impeded us from printing "this is the forward that triggered the backward" because that printing only happens if anomaly mode is enabled when you run backward(). This PR fixes it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101047
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-05-11 03:38:20 +00:00
Michael Voznesensky
fe3ecfe0cf Add AotAutogradFallbackTests to dynamic suite (#100454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100454
Approved by: https://github.com/ezyang
2023-05-04 04:28:45 +00:00
Animesh Jain
6bc4651193 [philox_rand] Dynamic shape support (#99290)
Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this

~~~

[2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
 ===== Forward graph 1 =====
 <eval_with_key>.7 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]):
        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0, s1] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1);  getitem = arg4_1 = None

        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg3_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0, s1] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        return (mul_1, add_4)

 ~~~

Each rand op is accompanied by its offset calculation op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-25 22:40:28 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Edward Z. Yang
ebd47b0eec Propagate mark_dynamic in Dynamo compiled outputs. (#99634)
If you run a user operation you'll lose it, but this will at least
get the easy stuff.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99634
Approved by: https://github.com/voznesenskym
2023-04-23 03:24:28 +00:00