Commit Graph

580 Commits

Author SHA1 Message Date
Xuehai Pan
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
Edward Z. Yang
9b91c91e64 Don't add to replacements when guard is suppressed (#126210)
Also improve logging when guards are suppressed

Partially addresses https://github.com/pytorch/pytorch/issues/125641

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126210
Approved by: https://github.com/jbschlosser
2024-05-23 20:10:29 +00:00
Edward Z. Yang
0d17aae242 Teach FakeTensor to fill in item_memo when converting scalar CPU tensor (#126245)
This PR requires a little justification, but let's start with what it does first:

1. When you have a 0d CPU scalar int64/float64 tensor input to a graph, we will preallocate a backed SymInt/SymFloat corresponding to what you would get if you call item() on this tensor. This means you can freely change your input to be a Python int/float or a Tensor with an item() call and end up with exactly the same level of expressivity (specifically, you can guard on the internal SymInt/SymFloat no matter what). By default, the source of the backed SymInt/SymFloat is `L['tensor'].item()`, but if you have promoted a float input into a Tensor, we will cancel out `torch.as_tensor(L['float']).item()` into just `L['float']`.
2. We switch wrap_symfloat to use this, instead of hand crafting the new SymNodeVariable. Everything works out, except that we carefully pass the item() result to tracked fakes (and not the fake Tensor argument)

OK, so why do this at all? There is some marginal benefit where now some item() calls on scalar inputs can be guarded on, but IMO this is a pretty marginal benefit, and if it was the only reason, I wouldn't do this. The real reason for this is that I need to be able to propagate fake tensors through the graphs that are produced by Dynamo, and if I am doing the old custom wrap_symfloat logic, there's no way I can do this, because ordinarily an item() call will cause an unbacked SymInt when I reallocate.

The other obvious way to solve the problem above is to make a HOP alternative that item() that "bakes in" the backed SymInt its supposed to return. But this strategy seems more parsimonious, and it does have the marginal benefit I mentioned above. The main downside is that what I have to do next, is make it so that when I run tensor computation, I also apply the equivalent operations to the SymInt/SymFloat as well. That's next PR.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126245
Approved by: https://github.com/eellison
ghstack dependencies: #126637
2024-05-22 15:25:38 +00:00
Edward Z. Yang
3ae118204e Make propagate_real_tensor more safe (#126281)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/

There a few improvements here, which luckily fix some xfails:

* In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it.
* We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor.
* I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126281
Approved by: https://github.com/Skylion007
2024-05-15 23:57:02 +00:00
Zhengxu Chen
f9dda37a74 [export] Cover more cases to copy tensor conversions. (#125628)
Summary:
Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: https://github.com/pytorch/PiPPy/issues/1104#issuecomment-2093352734

I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion

Differential Revision: D56951634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125628
Approved by: https://github.com/tugsbayasgalan
2024-05-15 15:50:21 +00:00
Edward Z. Yang
41fabbd93f Fanatically correct real tensor cloning for propagate_real_tensors (#126175)
Internal xref:
https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/

Previously I did it in a crappy way using clone_input in the callback,
but this results in tensors that don't have quite the same
size/stride/storage offset and there was an internal test case where
not having completely accurate information was causing a downstream
problem in propagation.  So now I make real tensors as similar to their
fake equivalents as much as possible.  Though... I don't bother with
autograd lol.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126175
Approved by: https://github.com/albanD
2024-05-14 23:14:17 +00:00
Pian Pawakapan
2973c9bb88 [export] add SchemaCheckMode testing for pre-dispatch export, OpInfo (#125481)
This adds a new dispatch mode, PreDispatchSchemaCheckMode, built on top of SchemaCheckMode, used for verifying op schemas for functionalization for PreDispatch IR. More specifically, the mode runs in eager mode on concrete inputs, checking if op schemas incorrectly claim to be functional, but are aliasing or mutating. This mode is pushed to the pre-dispatch mode stack, and run before decompositions.

Current testing is hooked up to OpInfo, containing 1103 tests on 600 unique ops. Below is a list of ops that fail testing. One caveat is we only raise errors on ops that claim to be functional - if an op schema admits aliasing or mutating but fails testing for the other, it still may decompose further and become functional.

List of failed ops:
```
aten.atleast_1d.default
aten.atleast_2d.default
aten.atleast_3d.default
aten.cartesian_prod.default
aten.conj_physical.default
aten.alpha_dropout.default
aten.feature_dropout.default
aten.feature_alpha_dropout.default
aten.unsafe_chunk.default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125481
Approved by: https://github.com/tugsbayasgalan
2024-05-14 21:07:21 +00:00
Edward Z. Yang
f19e07b056 Memoize local_scalar_dense calls, refactor all memos (#125623)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125623
Approved by: https://github.com/eellison
2024-05-11 21:12:35 +00:00
PyTorch MergeBot
c6e5d0d2e6 Revert "Memoize local_scalar_dense calls, refactor all memos (#125623)"
This reverts commit fcbf2b61e6.

Reverted https://github.com/pytorch/pytorch/pull/125623 on behalf of https://github.com/malfet due to Broke ROCM, see https://github.com/pytorch/pytorch/actions/runs/9026074378/job/24804583041 ([comment](https://github.com/pytorch/pytorch/pull/125623#issuecomment-2105444091))
2024-05-11 01:58:39 +00:00
Brian Hirsh
f25c7c9699 functionalize storage resizing, minimal ppFSDP traceable forward (#122434)
More details further down, but first a more high-level description of "how do we functionalize storage resizing"

Today, dynamo converts `param.untyped_storage().resize_(x)` calls that it sees from fsdp into a custom op, `ops.inductor.resize_storage_bytes_(x)`

So given this setup, there are 3 main cases that I think we want to handle:

(1) graph input starts with a real storage size, gets resized down to zero in the graph
(2) graph input starts with 0 storage size, gets resized up in the graph
(3) graph input starts with 0 storage size, gets resized up and used in some compute, then resized back down to 0

For case (1) we need to emit a `resize_storage_bytes_` at the end of the graph, similar to how we emit `copy_()` for data mutations.

For case (2), we need to emit a `resize_storage_bytes_` in the graph, and we **also** need to emit a `copy_()` (the input had its storage resized up, and filled in with data, which is we need to reflect as an input mutation)

For case (3), the net effect is that the input had no data on entry and exit of the function, so we don't need to emit any mutable ops in the end of the graph.

The main thing to call out is that: we need to write a functionalization rule for `resize_storage_byte_`, (`FunctionalTensorWrapper::storage_resize_()`) and this rule actually does very little. We would like to **not** emit any new ops in the graph (like say, a functional resize op). Instead, we should expect / rely on the fact that any resize up will be immediately followed by a `copy_()`/`foreach_copy_`/`out=` op, that will fill in the data of the tensor. So `FunctionalTensor` can temporarily live in a state where its data is invalid, until the `x.copy_(y)` "updates" its data with the new tensor.

So effectively, all that this rule does is:

(1) it stores metadata on the storage, indicating that the tensor was resized, as well as the updated storage size. We need this info in AOTAutograd, so it knows whether to emit a mutable resize_() op in the graph epilogue

(2) There is also a corner case: if we are resizing down to zero, but our tensor had **previously** had a zero size storage, then we update `value_` to point to the original value of the tensor. The reason this seems safe is because if we have a zero storage sized tensor `x`, and we resize it up, use it in some compute, resize it back down to zero, and use it somewhere, we would want the functional version of this code to use the original `x` after the second resize. For FSDP, this is important because we end up saving parameters (graph inputs) for backward, and we want to make sure that the thing we save (and the output to the forward graph) is the original, zero-storage-sized parameter, and not the "version 2" of the parameter after the first resize_()

I think a good order to look at changes in this PR would be:

(1) `test_aotdispatch.py` shows the 3 main cases I focused on as well as the expected functionalized graphs

(2) In `FunctionalStorageImpl.h/cpp`, I had to add a notion of "original base", and "original/curr_size". The first is so I can re-use the zero-size tensor after multiple resizes, and the second is so I can tell in AOTAutograd whether any resizes canceled each other out into a no-op

(3) FunctionalTensorWrapper.h/cpp has the new resize functionalizion rule + some extra utils

(4) `_functorch/_autograd`: the main changes in this folder were around adding the logic at trace-time to detect when we need to put a resize_() in the graph. I also have some assertions to check that any inputs that experience storage resizing will **always be in the graph** and not the opaque epilogue, and I also limited the resize_() mutation case so that you can only ever start with zero storage, or end with zero storage (you can't do e.g. `torch.ones(2).storage().resize_(3)`), and banned it on tensor subclasses

(5) `fake_tensor.py`/`meta_utils.py`: we now need to be able to fakeify tensors with zero storage, so I added a quick version of it in meta_utils.py. This also.. has ramifications for fake tensor caching that I need to fix (include the storage size on the cache key, maybe?)

------------------

This PR subsumes https://github.com/pytorch/pytorch/pull/120971.

This PR is enough to **almost** get a simple ppFSDP forward pass tracing with a functionalized resize_() properly. It also attempts to do the updated version from @jansel, where we don't have any notion of `resize_()` in the graph at all, post functionalization. It would probably be good to test it with @yf225 's FSDP changes, and see how many of the FX passes it allows us to remove. I think that in theory, it should allow us to remove all FX passes that affect the forward graph / partitioner, **except** the one that forces views to be recomputed in the backward (more details below).

There are a few things worth calling out:

(1) failed attempt at functionalizing `aten.copy_()`. I originally wanted to get a version takes these operations:
```
param.storage().resize_(all_gather_size)
param.copy_(all_gather_buffer)
out = aten.matmul(param, param)
```
and functionalizes them into:
```
out = aten.matmul(all_gather_buffer, all_gather_buffer)
```

This would involve getting functionalization to turn `x.copy_(y)` into a giant no-op that just returns `y`. Unfortunately, we can't actually do this in a reasonable way within functionalization (instead, there's a functional `aten.copy` in the graph - see the test case graph expecttest for details). Why? In order for that transformation to be safe, `x` and `y` need to have the same metadata. However, it's possible for `x` and `y` to be subclasses of different types. This is not something we can easily tell from within functionalization, and would be a layering violation. So for now I'm leaving it to downstream code to optimize away the `aten.copy` (this is already the case today, so I think inductor can handle this)

(2) The forward doesn't **actually** run successfully in this PR (see the `assertRaisesRegex` in the test). Why?

The final forward graph looks like this:
```
def forward(self, primals_1, primals_2):
    _foreach_copy = torch.ops.aten._foreach_copy.default([primals_1], [primals_2]);  primals_2 = None
    getitem = _foreach_copy[0];  _foreach_copy = None
    mm = torch.ops.aten.mm.default(getitem, getitem);  getitem = None
    t_1 = torch.ops.aten.t.default(primals_1);  primals_1 = None
    return [mm, t_1]
```

Where `primals_1` starts out as a secretly-zero-storage-size parameter, and gets resized up and back down within the forward (these are functionalized away).

Importantly, the matmul happy on the result of the `foreach_copy`, **but** the activation that we save for backward (`t_1`) is the result of transposing the **original parameter** (the zero-storage-size param). This is exactly the optimization in fsdp that allows us to have good peak memory usage.

The problem is that the min-cut partitioner decides to save `t_1` for backward. Running this code in eager breaks, because the kernel for `aten.permute(x)` is not happy when `x` has secretly-zero-sized-storage.

The real problem here is that in eager mode the `permute` kernel runs during the backward, after backward hooks have properly resized the saved activation. Here, we are running the transpose in the forward.

One option would be to turn off the checks in our view kernels and allow them to work on zero-storage-sized tensors, which feels pretty bad. Another option is to tweak the partitioner (or use one of Will's FX passes) to force the partitioner to not save views for backward, and allow the views to be recomputed in the backward. This seems kind of silly, but is also probably harmless.

(3) The backward is still broken. To be fair, this issue is pretty separable from "functionalizing storage resize calls", and can be fixed later (either by a real fix to our tracing infra, or via another hacky FX pass). More description of this problem is described at issue (8) of my PR description in https://github.com/pytorch/pytorch/pull/120971

(4) I only added support for "full graph" resizing: basically, the limited case where a param starts with zero storage size, and gets resized up and back down. I think we can add support for the graph break case, but I think we can keep that add-on separate from this PR unless we need it immediately. I also added asserts so we should fail loudly when we hit this case

(5) I have a change to FakeTensor creation when inputs have zero storage size that.. is probably ok. But I also removed FakeTensor caching on view ops, which I probably need to fix before I can land this PR

(6) I added a notion of "original_base" to `FunctionalStorageImpl`. More details are in the comments, but my rational for this was that we basically need it to ensure that autograd saves the **original**, zero-storage-sized param for backward, after resizing up and back down

(7) I had to update our eager kernels for `aten.copy` and `aten._foreach_copy`, to handle the case where the `self` argument has secretly-zero-storage. Inductor can probably generate correct code for this case, but we need these ops to work properly in this situation for the `aot_eager` backend to do the right thing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122434
Approved by: https://github.com/jansel
2024-05-10 18:09:10 +00:00
Tugsbayasgalan Manlaibaatar
d7fe3c4123 [RELAND] Switch default behavoir of export IR to be predispatch (#125860)
This PR switches export IR from aot-dispatch to pre-dispatch IR.

**What is pre-dispatch IR and why should you care?**

Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run.

In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result:

You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph.
You can write sound graph transformations more easily as the IR is functional.
Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR.
If you want to get the core aten IR out of torch.export, you will need to:
```
ep = torch.export.export(M(), inputs)
ep_for_core_aten = ep.run_decompositions()
```

Differential Revision: [D57172986](https://our.internmc.facebook.com/intern/diff/D57172986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125860
Approved by: https://github.com/zhxchen17
2024-05-10 17:36:53 +00:00
Edward Z. Yang
fcbf2b61e6 Memoize local_scalar_dense calls, refactor all memos (#125623)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125623
Approved by: https://github.com/eellison
2024-05-10 01:52:55 +00:00
Aaron Orenstein
e43d656921 FakeTensor speedup: minor cleanups (#124224)
A few cleanup tasks that didn't really fit into the other diffs in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124224
Approved by: https://github.com/oulgen
ghstack dependencies: #122911, #124223
2024-05-09 22:11:51 +00:00
Aaron Orenstein
a08be4b705 FakeTensor speedup: Split cache_key so we only validate once (#124223)
When dispatching a fake tensor op we cache the result with `(op, args)` as the key. There are some args (such as one with a dynamic output shape) where the output can't be cached. Instead of validating the args every time we compute the cache only validate the args when we first see a new cache key.

18.3% FakeTensor perf win on the microbenchmark (21.7% cumulative)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124223
Approved by: https://github.com/oulgen, https://github.com/masnesral
ghstack dependencies: #122911
2024-05-09 22:11:51 +00:00
Aaron Orenstein
6a8b1da18d FakeTensor speedup: Delay formatting stack trace until it's actually asked for. (#122911)
When constructing a `FakeTensorMode`, instead of immediately formatting a full stack trace, grab the traceback and only format it on demand.

4.2% FakeTensor perf win on the microbenchmark.

```
import time
import torch
import torch._dynamo as dynamo
from torch._subclasses.fake_tensor import FakeTensorMode
import numpy as np

def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    b = b * -1
    return x * b

def run_test1():
    dynamo.reset()
    j = [1, 2, 3]
    toy_example(torch.randn(j), torch.randn(j))

def run_test2():
    dynamo.reset()
    j = [1, 2, 3]
    with FakeTensorMode():
        toy_example(torch.randn(j), torch.randn(j))

ITERATIONS = 500000
FORMAT_STRING = "{name:12}: TOT: {tot:10.3f}, AVG: {avg:10.3f}, MIN: {min:10.3f}, P50: {p50:10.3f}, P90: {p90:10.3f}, P99: {p99:10.3f}"

def run_tests(name, step):
    step()
    timings = []
    start = time.time()
    for i in range(ITERATIONS):
        a = time.perf_counter_ns()
        step()
        b = time.perf_counter_ns()
        timings.append(b - a)
    end = time.time()
    fmt = {
        "best": min(timings),
        "tot": end - start,
        "avg": np.average(timings),
        "min": min(timings),
        "p50": np.percentile(timings, 50),
        "p90": np.percentile(timings, 90),
        "p99": np.percentile(timings, 99)
    }
    print(FORMAT_STRING.format(name=name, **fmt))
    return fmt

ts = run_tests("tensor", run_test1)
fs = run_tests("fake tensor", run_test2)
ratio = {k: a / b for ((k, a), (_, b)) in zip(fs.items(), ts.items())}
print(FORMAT_STRING.format(name="ratio", **ratio))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122911
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-05-09 22:11:51 +00:00
Aaron Gokaslan
8cad88e1f3 [BE]: Improve exception typing. Remove NOQAs (#125535)
Improve some exception typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125535
Approved by: https://github.com/albanD
2024-05-08 14:07:13 +00:00
Angela Yi
38baa02a40 Meta kernel for _pack_padded_sequence (#124794)
Summary: Op implementation: 8cf54929e3/aten/src/ATen/native/PackedSequence.cpp (L34)

Fixes https://fb.workplace.com/groups/pytorch.edge.users/permalink/1499571650913123/

I'm not entirely sure how to test this meta kernel.

Differential Revision: D56478332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124794
Approved by: https://github.com/ezyang
2024-05-08 03:11:22 +00:00
PyTorch MergeBot
7ffa5558ee Revert "[FX] Update type hints in torch.fx._compatibility.py (#125469)"
This reverts commit 235b4d6ec2.

Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))
2024-05-06 18:36:43 +00:00
Aaron Gokaslan
1dd42e42c4 [BE]: Try TCH autofixes on torch/ (#125536)
Tries TCH autofixes and see what breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536
Approved by: https://github.com/ezyang
2024-05-05 23:13:59 +00:00
Xuehai Pan
235b4d6ec2 [FX] Update type hints in torch.fx._compatibility.py (#125469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469
Approved by: https://github.com/Skylion007
ghstack dependencies: #125468
2024-05-05 19:30:22 +00:00
Edward Z. Yang
e93b57a570 Add propagate_real_tensors mode for unbacked (#125115)
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.

This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.

I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:

```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```

Potential later follow ups:

* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
2024-05-02 15:28:26 +00:00
Brian Hirsh
5173cbe260 fix FakeTensor creation on noncontiguous subclasses (#124399)
Fixes https://github.com/pytorch/pytorch/issues/125287

Fixes https://github.com/pytorch/pytorch/issues/124090, context on the issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124399
Approved by: https://github.com/soulitzer
ghstack dependencies: #124398
2024-05-01 21:56:01 +00:00
Avik Chaudhuri
746da8755c switch tests from constrain_as* to torch._check* (#125253)
To fix data-dependent errors we want to recommend that people use `torch._check*` APIs. The `constrain_as*` APIs should be fully subsumed by them, and in the future we should kill them entirely.

Differential Revision: D56774333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125253
Approved by: https://github.com/ezyang
2024-05-01 21:01:27 +00:00
Edward Z. Yang
c511aed27f [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-05-01 06:53:49 +00:00
PyTorch MergeBot
ae13c7e593 Revert "[Meta Tensor] fix meta inplace set storage (#123880)"
This reverts commit cccae93551.

Reverted https://github.com/pytorch/pytorch/pull/123880 on behalf of https://github.com/izaitsevfb due to breaks cpu_inductor_torchbench (detectron2_fasterrcnn) ([comment](https://github.com/pytorch/pytorch/pull/123880#issuecomment-2083366385))
2024-04-29 18:19:42 +00:00
Yuanjing Shi
cccae93551 [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-04-28 17:01:12 +00:00
Aaron Orenstein
609c958281 Fix mypy issues in fake_tensor.py (#124428)
fake_tensor.py had mypy error ignored. That seems less than desirable.

Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).

Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
2024-04-26 15:35:53 +00:00
PyTorch MergeBot
f131c2c199 Revert "Fix mypy issues in fake_tensor.py (#124428)"
This reverts commit 25c0d3f3f0.

Reverted https://github.com/pytorch/pytorch/pull/124428 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
Edward Z. Yang
29b22fbef9 Typo fix: s/nonzero/unique/ (#124935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124935
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-04-25 17:22:50 +00:00
Aaron Orenstein
25c0d3f3f0 Fix mypy issues in fake_tensor.py (#124428)
fake_tensor.py had mypy error ignored. That seems less than desirable.

Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).

Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
2024-04-25 14:07:53 +00:00
Edward Z. Yang
13ab24f192 Reimplement unbacked symbol bindings in Inductor (#124394)
This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.

1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and  **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item)
   * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
    * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
    * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all.
    * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending
4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
    * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes)
    * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too!  Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
    * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at  **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
    * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`.
    * **torch/fx/experimental/symbolic_shapes.py** - A few things
      * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
      * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass.
      * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316
2024-04-25 02:08:59 +00:00
Zhengxu Chen
02ed2992d9 [export] Capture tensor.to() under export. (#123732)
Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to

Differential Revision: D55969674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123732
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-04-24 23:12:19 +00:00
Edward Z. Yang
a22847a9cb We should not be in kernel invocation before we restore fake mode (#124762)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124762
Approved by: https://github.com/eellison
ghstack dependencies: #124760
2024-04-24 20:32:59 +00:00
Edward Z. Yang
0d58aeb73a Handle size/etc accessors in FakeTensor, support accessing symbolic types from toInt/etc in IValue (#124760)
Fixes https://github.com/pytorch/pytorch/issues/122772

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124760
Approved by: https://github.com/albanD, https://github.com/eellison
2024-04-24 20:32:59 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
674e15ae07 Back out "Switch to predispatch" (#124860)
Summary:
Original commit changeset: 1f155b3a0bfc

Original Phabricator Diff: D56273267

Test Plan: CI

Differential Revision: D56526505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124860
Approved by: https://github.com/angelayi
2024-04-24 17:28:33 +00:00
Tugsbayasgalan Manlaibaatar
c933af2709 Switch to predispatch (#123573)
This PR switches export IR from aot-dispatch to pre-dispatch IR.

**What is pre-dispatch IR and why should you care?**

Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run.

In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result:
- You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph.
- You can write sound graph transformations more easily as the IR is functional.
- Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR.

If you want to get the core aten IR out of `torch.export`, you will need to:
```
ep = torch.export.export(M(), inputs)
ep_for_core_aten = ep.run_decompositions()
```

Differential Revision: [D56273267](https://our.internmc.facebook.com/intern/diff/D56273267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123573
Approved by: https://github.com/gmagogsfm
2024-04-24 00:51:09 +00:00
Aaron Gokaslan
c5fafe9f48 [BE]: TRY002 - Ban raising vanilla exceptions (#124570)
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.

I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
2024-04-21 22:26:40 +00:00
rzou
889e3eeed3 Avoid cuda init to FakeTensorMode (#124413)
Also partially fixes #122109

This PR:
- We add a C++ flag (only_lift_cpu_tensors) to toggle the
  torch.tensor(1, device='cuda') ctor strategy.
  When false (default), it does the current PyTorch behavior
  of unconditionally constructing a concrete CUDA tensor then calling
  lift_fresh on it. When true, we instead construct a concrete CPU
  tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient
  modes).
- FakeTensorMode flips this flag depending on if CUDA is available or
  not. We don't unconditionally set the flag to True because that is
  likely BC-breaking.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413
Approved by: https://github.com/eellison
2024-04-19 02:39:35 +00:00
Tugsbayasgalan Manlaibaatar
d23bf9cef0 Add fake impl for aten.unique2 (#124306)
Reapply of: https://github.com/pytorch/pytorch/pull/121571
Differential Revision: [D56258431](https://our.internmc.facebook.com/intern/diff/D56258431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124306
Approved by: https://github.com/gmagogsfm
2024-04-17 22:55:27 +00:00
Brian Hirsh
7efaf54dc4 Fakeifying views shouldnt create symbols when dynamic=False (#123348)
Fixes https://github.com/pytorch/pytorch/issues/123298

I was also seeing some crashes in torchtrain due to dynamic shapes, even when I set `compile(dynamic=False)` (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @wanchaol). This doesn't fix the underlying dynamic shape issues with compile + DTensor, but it does prevent dynamic shapes from leaking in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123348
Approved by: https://github.com/ezyang
ghstack dependencies: #122502, #122751
2024-04-12 01:12:23 +00:00
Aaron Orenstein
2bcc83dfbd Preserve dispatch state across function tracing (#122073)
If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that.

Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code.  Fixed that to reset after as well.

Repro:

before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
======== 1 passed, 6173 deselected in 5.21s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
========= 1 skipped, 6172 deselected, 1 error in 5.29s =========
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 6173 deselected in 5.42s =====================
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ======================
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073
Approved by: https://github.com/zou3519
2024-04-10 18:57:01 +00:00
drisspg
f4e2a226aa ScoreMod API (#121845)
# Summary

This PR adds a new higher-order_op: `templated_attention`.  This op is designed to extend the functionality of torch.nn.fucntional.scaled_dot_product_attention.  PyTorch has efficient pre-written fused-attention kernels. However, users want to modify how scores are computed (a substep inside attention) -- this traditionally requires the user to write their own attention kernel. One such modification to attention scores that is not currently supported by the top level SDPA op is:[ Attention with Linear Biases (ALiBi](https://arxiv.org/abs/2108.12409)).

This higher-order op will instead accept a callable( 'score_mod') function that is through torch.compile will be used to create an efficient attention kernel instantiation.

### Details

This HOP utilizes the existing fx and HOP infra to capture and convert the User `score-mod` function and convert to an FX graph module. Inductor then consumes this HOP that has a `ir.Subgraph` input. It will inline this lowered subgraph into a triton kernel which performs fused attention with the modification to the scores matrix inlined.

### API

The API for a score_mod function should be as follows:

```Python
def score_mod(score: torch.Tensor, batch: torch.Tensor, head: torch.Tensor, token_1: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor
```

This function receives five parameters:

- `score`: A scalar tensor representing the attention score, with the same data type and device as the query, key, and value tensors.
- `batch`, `head`, `seq_len_q`, `seq_len_kv`: Scalar tensors indicating the batch index, head index, query index, and key/value index, respectively, with torch.int data type and located on the same device as the score tensor.

Consider inputs query, key, value of shapes (2, 4, 16, 8), leading to an intermediate attention score matrix of shape (2, 4, 16, 16)

The score_mod function will be vectorized over each element of this matrix. For instance, modifying the score at the position corresponding to the 0th batch, 2nd head, between the 8th query and the 9th key element, would be invoked as:

```Python
score_mod(score[0,2,8,9], torch.tensor(0), torch.tensor(2), torch.tensor(8), torch.tensor(9))
```

### Examples
```Python
import torch
from torch.nn.attention.templated_attention import templated_attention

torch.manual_seed(0)

# Lets create some input tensors
# The input tensor has shape (batch_size, num_heads, seq_len, head_dim)
query = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32)
key = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32)
value = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32)

# Lets create a fun new score_modification! I will call this
# Checkerboard. It will reduce the score for neighboring tokens (1 step apart)
# in the sequence. And increase the score for tokens 2 steps apart. For everything
# else, the score will remain the same.

def checkerboard(score, batch, head, token_q, token_kv):
    score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score)
    score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score)
    return score

# Lets call templated_attention with this new score modification
output = templated_attention(query, key, value, score_mod=checkerboard)

compiled_templated_attention = torch.compile(templated_attention)
out_compiled = compiled_templated_attention(query, key, value, score_mod=checkerboard)

torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2)
```

### Future Work
- This PR is currently only forward only. However the triton kernel for backwards where score_modifications to not rely on external buffers has been explored here: https://github.com/drisspg/transformer_nuggets/blob/main/transformer_nuggets/flash/flash_attention.py
- Kernel Improvements; There are has been some larger updates to the fused attention implementation that Triton uses in its tutorials. The implementation of this kernel is based on a prior version and should be updated.
- We may want to unify this API under the top level SDPA API and leave that as a follow up once this is more stable
- Should we error on CPU?
- There are some issues with dynamic shapes
- Capturing of free variables and lifting to inputs to the subgraph is not working correctly today

### Performance
Comparisons generated by this benchmark:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     5.412 |              |             |             |             |            |               |                |
| Max     |     8.882 |           16 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |
| Min     |     3.645 |            8 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |
| Min     |     0.345 |            1 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |

For reference

| Configuration                                 | Forward Time (µ seconds) | Backend          | Speedup |
|-----------------------------------------------|--------------------------|------------------|---------|
| Fastest Config in Sweep (`8 16 4096 4096 64 relative_bias torch.bfloat16`) | 3608                   | Templated Attention                | 1.0  |
| Compiled SDPA (No Mask)                       | 9928                   | Math             | 2.75x   |
| Compiled SDPA (With Mask)                     | 11898                    | Math             | 3.29x   |
| Compiled SDPA (With Mask) | 8704                      | Memory Efficient Attention | 2.42x   |
| Compiled SDPA (No Mask) | 2548                     | FlashAttention2 | 0.706x   |

The speedups are measuring compiled templated attention speed versus different calls to torch.nn.functional.sdpa

<details>

<summary> FULL PERFORMANCE SWEEP NUMBERS </summary>

|   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |   eager_time |   compiled_time |   speedup |
|--------------|-------------|-------------|-------------|------------|---------------|----------------|--------------|-----------------|-----------|
|            1 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |      331.444 |          67.221 |     4.931 |
|            1 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |      335.300 |          64.187 |     5.224 |
|            1 |          16 |         512 |         512 |         64 | head_bias     | torch.bfloat16 |      352.039 |          63.806 |     5.517 |
|            1 |          16 |         512 |         512 |         64 | pathological  | torch.bfloat16 |      371.699 |         711.349 |     0.523 |
|            1 |          16 |        1024 |        1024 |         64 | causal_mask   | torch.bfloat16 |      333.488 |          86.455 |     3.857 |
|            1 |          16 |        1024 |        1024 |         64 | relative_bias | torch.bfloat16 |      322.363 |          82.469 |     3.909 |
|            1 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |      349.967 |          82.233 |     4.256 |
|            1 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |      486.359 |        1412.453 |     0.344 |
|            1 |          16 |        4096 |        4096 |         64 | causal_mask   | torch.bfloat16 |     2794.597 |         551.188 |     5.070 |
|            1 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |     3965.150 |         513.101 |     7.728 |
|            1 |          16 |        4096 |        4096 |         64 | head_bias     | torch.bfloat16 |     2408.013 |         504.759 |     4.771 |
|            1 |          16 |        4096 |        4096 |         64 | pathological  | torch.bfloat16 |     6850.531 |       16733.675 |     0.409 |
|            8 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |      441.939 |         123.576 |     3.576 |
|            8 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |      560.379 |         116.710 |     4.801 |
|            8 |          16 |         512 |         512 |         64 | head_bias     | torch.bfloat16 |      421.172 |         115.825 |     3.636 |
|            8 |          16 |         512 |         512 |         64 | pathological  | torch.bfloat16 |      994.492 |        2132.806 |     0.466 |
|            8 |          16 |        1024 |        1024 |         64 | causal_mask   | torch.bfloat16 |     1436.430 |         309.495 |     4.641 |
|            8 |          16 |        1024 |        1024 |         64 | relative_bias | torch.bfloat16 |     1892.216 |         290.186 |     6.521 |
|            8 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |     1360.665 |         282.956 |     4.809 |
|            8 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |     3525.532 |        8359.702 |     0.422 |
|            8 |          16 |        4096 |        4096 |         64 | causal_mask   | torch.bfloat16 |    22026.839 |        3864.604 |     5.700 |
|            8 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |    31262.746 |        3609.551 |     8.661 |
|            8 |          16 |        4096 |        4096 |         64 | head_bias     | torch.bfloat16 |    20219.079 |        3480.402 |     5.809 |
|            8 |          16 |        4096 |        4096 |         64 | pathological  | torch.bfloat16 |    54654.647 |      116652.357 |     0.469 |
|           16 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |      820.606 |         188.683 |     4.349 |
|           16 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |     1058.362 |         179.295 |     5.903 |
|           16 |          16 |         512 |         512 |         64 | head_bias     | torch.bfloat16 |      784.372 |         175.714 |     4.464 |
|           16 |          16 |         512 |         512 |         64 | pathological  | torch.bfloat16 |     1890.792 |        4212.877 |     0.449 |
|           16 |          16 |        1024 |        1024 |         64 | causal_mask   | torch.bfloat16 |     2781.830 |         557.017 |     4.994 |
|           16 |          16 |        1024 |        1024 |         64 | relative_bias | torch.bfloat16 |     3694.050 |         525.249 |     7.033 |
|           16 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |     2634.164 |         507.613 |     5.189 |
|           16 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |     6959.917 |       15331.116 |     0.454 |
|           16 |          16 |        4096 |        4096 |         64 | causal_mask   | torch.bfloat16 |    43889.096 |        7582.018 |     5.789 |
|           16 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |    62784.293 |        7075.846 |     8.873 |
|           16 |          16 |        4096 |        4096 |         64 | head_bias     | torch.bfloat16 |    40308.606 |        6829.587 |     5.902 |
|           16 |          16 |        4096 |        4096 |         64 | pathological  | torch.bfloat16 |   108892.137 |      233090.953 |     0.467 |
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121845
Approved by: https://github.com/Chillee, https://github.com/zou3519
2024-04-06 01:10:44 +00:00
rzou
81e7a7c955 Add mutated_args field to custom_op (#123129)
If provided, we:
- autogenerate an ADInplaceOrView implementation
- assume that no mutated inputs are returned as outputs. There are
  already aliasing runtime checks that check this.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123129
Approved by: https://github.com/albanD
ghstack dependencies: #123108, #123109, #123110
2024-04-05 22:03:51 +00:00
rzou
d486cb7c1b Deprecate calling FakeTensor.data_ptr in eager-mode (#123292)
Today, we error out on FakeTensor.data_ptr under torch.compile. This PR
moves to error out on FakeTensor.data_ptr under eager mode to avoid
diverging behavior.

We do this by adding another bit onto FakeTensor that we'll remove after
the deprecation cycle.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123292
Approved by: https://github.com/eellison
ghstack dependencies: #123261, #123282, #123291
2024-04-04 20:35:24 +00:00
rzou
fd60752786 Turn _allow_unsafe_data_ptr_access into a config option (#123291)
We're not planning on having this flag around for very long (see
deprecation in next PR), so it's better as a config option.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123291
Approved by: https://github.com/eellison
ghstack dependencies: #123261, #123282
2024-04-04 20:35:24 +00:00
ydwu4
c77352b5cc Add torch._library.register_fake_class to fakify torchBind class (#122622)
This PR only adds abstract class registration logic without touching existing tests so they still trace with real script object. The added tests are only for registration APIs and test error messages.

Our design is that the abstract implementation should be in Python. This is much better in terms of usability. But this also has implications for custom op that takes script object as input, which is detailed later in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122622
Approved by: https://github.com/zou3519
ghstack dependencies: #122619, #122620, #122621
2024-04-02 23:52:17 +00:00
Pearu Peterson
72662bf05b [BE] Add torch.ops.aten._sparse_compressed_tensor_with_dims (#123083)
Used in https://github.com/pytorch/pytorch/pull/123084 and allows simplifying `empty_like` implementation for sparse compressed tensors (see https://github.com/pytorch/pytorch/pull/121900#issuecomment-2029835473).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123083
Approved by: https://github.com/cpuhrsch
2024-04-02 10:12:21 +00:00
Guilherme Leobas
9ff2a9dcdd [dynamo] Skip leaf check on assert_metadata_eq if grad tensor level is -2 (#122728)
When fakifying a grad tracking tensor, if the level is -2 (sentinel
value) we can just unwrap the grad tensor and return a fake version of
it. In this PR, we update the `assert_metadata_eq` to not compare if
the grad tensor and the unwrapped ones are leafs or not, as this may
not be always true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122728
Approved by: https://github.com/zou3519
2024-04-01 15:38:16 +00:00
drisspg
557e7c9c16 Add some type hints to functions and update a few spelling mistakes (#123015)
# Summary
While working on this PR: https://github.com/pytorch/pytorch/pull/121845
I found that these type hints made my ide/ noob experience easier to reason about

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123015
Approved by: https://github.com/Skylion007
2024-03-30 21:15:01 +00:00
Edward Z. Yang
8c8e4e31f2 Some improvements to nonzero post guard_size_oblivious (#122156)
Prompted by https://github.com/pytorch/pytorch/pull/121571

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122156
Approved by: https://github.com/jansel
2024-03-28 03:53:16 +00:00
rzou
c81c9ba472 Disallow {FakeTensor,FunctionalTensor}.data_ptr (#122514)
This PR:
- disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing.
- disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in
  PT2)

The motivation behind this is that the leading cause of segfaults when
using custom ops with PT2 is calling .data_ptr on FunctionalTensor or
FakeTensor.

This change is BC-breaking. If your code broke as a result of this, it's
because there was a bug in it (these .data_ptr should never be
accessed!). You can either fix the bug (recommended) or get the previous
behavior back with:
```
from torch._subclasses.fake_tensor import FakeTensor
from torch._subclasses.functional_tensor import FunctionalTensor

data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr()
```

Test Plan:
- existing tests

Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler
2024-03-26 23:55:42 +00:00
Edward Z. Yang
85845a29db Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310
Approved by: https://github.com/masnesral, https://github.com/lezcano
2024-03-26 14:16:33 +00:00
Edward Z. Yang
268b0cc714 Do not run CUDA lazy init if it is triggered with fake mode on. (#122636)
Partially fixes https://github.com/pytorch/pytorch/issues/122109

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122636
Approved by: https://github.com/zou3519
2024-03-26 05:43:59 +00:00
Edward Z. Yang
1af6fc5e03 Remove top-level DisableFuncTorch; clearing interpreter stack should work. (#122610)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122610
Approved by: https://github.com/zou3519
ghstack dependencies: #122202
2024-03-26 03:08:22 +00:00
Edward Z. Yang
05bbcae5bb Refactor functorch meta conversion (#122202)
At a high level, the goal of this refactor was to make it so that `MetaConverter.__call__` has a straightforward code structure in three steps: (1) check if we support doing meta conversion, (2) describe the tensor into MetaTensorDesc, (3) call `meta_tensor` on MetaTensorDesc. However, this is not so easy to do, because there is a big pile of special cases for functional tensor inside `__call__`.

The primarily complication is handling the ambient functionalization state: specifically, the functorch dynamic layer stack and the Python functionalization dispatch. The old code demands that meta tensor conversion happen with this state disabled. But I discovered that when I reconstruct functorch tensors it demands that the functorch layers be active; in fact a batch tensor will have a pointer to the internal functorch layer.

I had some discussion with Richard Zou about what code structure here makes sense. In particular, one of the goals of the refactor here is that I can inflate MetaTensorDesc from an entirely different process, which may not have all of the functorch layers activated at the time we do reconstruction. So it seems to me that we should make it explicit in MetaTensorDesc that there was some functorch layer active at the time the functorch tensor was serialized, so that we could potentially know we need to reconstruct these layers on the other side. This is NOT implemented yet, but there's some notes about how potentially it could proceed. But the important thing here is we SHOULD disable everything when we run `meta_tensor`, and internally be responsible for restoring the stack. Actually, the necessary infra bits in functorch don't exist to do this, so I added some simple implementations in pyfunctorch.py.

The rest is splitting up the manipulations on tensor (we do things like sync the real tensor before describing it; Describer is responsible for this now) and I also tried to simplify the not supported condition, based on my best understanding of what the old thicket of conditions was doing. You may notice that the internal meta_tensor handling of functional tensor is inconsistent with surrounding code: this is because I *exactly* replicated the old reconstruction behavior; a further refactor would be to rationalize this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122202
Approved by: https://github.com/zou3519
2024-03-25 20:47:21 +00:00
Edward Z. Yang
47a9725de9 Implement prefer_deferred_runtime_asserts_over_guards (#122090)
Fixes https://github.com/pytorch/pytorch/issues/121749

As promised, it is pretty easy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122090
Approved by: https://github.com/lezcano
2024-03-25 16:31:16 +00:00
Edward Z. Yang
1e404c9b12 Remove redundant query to tensor_to_context (#122278)
from_real_tensor will query it again, so this query is strictly
dominated.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122278
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270, #122271
2024-03-25 13:16:21 +00:00
Edward Z. Yang
49b81af45f Delete dead memoized_only kwarg in FakeTensor (#122271)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122271
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270
2024-03-25 13:16:21 +00:00
Edward Z. Yang
f32ce4e28e Delete FakeTensorConverter.__call__ in favor of from_real_tensor (#122270)
It's annoying grepping for `__call__` call-sites so they're now all explicit now. I'd do this to MetaConverter too but that one is way more public, a lot more sites.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122270
Approved by: https://github.com/eellison
ghstack dependencies: #122044
2024-03-25 13:16:13 +00:00
Edward Z. Yang
5891c5b3a6 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
2024-03-25 06:21:17 +00:00
Guilherme Leobas
4eaa000acc Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-22 20:25:47 +00:00
PyTorch MergeBot
f65373e278 Revert "Factor meta conversion through serializable MetaTensorDesc (#122044)"
This reverts commit e2d89e9704.

Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))
2024-03-22 12:46:21 +00:00
Edward Z. Yang
e2d89e9704 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
2024-03-22 03:56:34 +00:00
Joel Schlosser
470b44c048 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #113279
2024-03-22 02:12:37 +00:00
PyTorch MergeBot
12e7602cf9 Revert "Support for torch.nested.as_nested_tensor(t) (#113280)"
This reverts commit 17c9c70265.

Reverted https://github.com/pytorch/pytorch/pull/113280 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113280#issuecomment-2013893099))
2024-03-21 22:00:44 +00:00
Joel Schlosser
17c9c70265 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-03-21 04:13:55 +00:00
PyTorch MergeBot
0696db8202 Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit 17489784b6.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))
2024-03-20 18:34:43 +00:00
Adnan Akhundov
2e02e1efad Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Fixes https://github.com/pytorch/pytorch/issues/122127

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-20 14:44:55 +00:00
Guilherme Leobas
17489784b6 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-20 13:09:19 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
5b7ceab650 Support auto_functionalize in pre-dispatch (#122177)
Summary: Title

Test Plan: CI

Differential Revision: D55042061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122177
Approved by: https://github.com/zou3519
2024-03-20 04:17:58 +00:00
PyTorch MergeBot
36e5c1dcab Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit edd04b7c16.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))
2024-03-19 18:59:46 +00:00
PyTorch MergeBot
7673cb534a Revert "Skip nonzero unbacked SymInt memo in inference mode (#122147)"
This reverts commit 5e26873912.

Reverted https://github.com/pytorch/pytorch/pull/122147 on behalf of https://github.com/jeanschmidt due to Reverting to see if trunk error in inductor are related ([comment](https://github.com/pytorch/pytorch/pull/122147#issuecomment-2007513000))
2024-03-19 15:37:24 +00:00
Guilherme Leobas
edd04b7c16 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-19 13:06:42 +00:00
Adnan Akhundov
5e26873912 Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-19 03:20:33 +00:00
Edward Z. Yang
74c09a757b Simplify Storage meta conversion with PyObject preservation (#122018)
Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on
finalizers on Storage PyObject to handle removal from dict.

Irritatingly, we still have to attach finalizer, because we don't have
a weak key AND value dict (only one or the other).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018
Approved by: https://github.com/eellison, https://github.com/kurtamohler
2024-03-18 18:55:58 +00:00
Edward Z. Yang
6f74b76072 Move get_unwrapped outside of disable_functorch (#121849)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121849
Approved by: https://github.com/albanD
2024-03-16 22:25:07 +00:00
Wenting Wang
02bb2180f4 [torch export] replace traceback.extract_stack with CapturedTraceback.extract (#121449)
Summary:
with a simple bench in TestDeserializer.test_basic function:
```
time_start = time.time()
for i in range(1000):
    self.check_graph(MyModule(), inputs)
warnings.warn(f"time_taken: {time.time() - time_start}")
```
and forcing FakeTensorConfig.debug to True, record_stack_traces to True, logging level to debug, it shows that the the changed code is consistently ard 20 secs faster (~90s vs originally ~110s)

Test Plan:
test passed, see summary

compared debug trace before and after:
- exactly the same for fake tensor and proxy callsite https://www.internalfb.com/intern/diffing/?paste_number=1189883685
- slightly different for the user frame in proxy node https://www.internalfb.com/intern/diffing/?paste_number=1189884347

Differential Revision: D54237017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121449
Approved by: https://github.com/angelayi
2024-03-13 00:19:05 +00:00
Colin Peppler
13a25c647f [export] improve binary op fast path broadcast check (#121546)
# Context
I believe we have an incorrect guard being created during FakeTensor's binary op fast path.

Consider this case
```
# op.shape: (10, 192); final_shape: (s0, 10, 192)
# Guard Ne(s0, 10) is created when we create SymBool(10 == s0)
if isinstance(op, torch.Tensor) and op.shape == final_shape:
    break
```

As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape.
* If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`).
* If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`.

This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`.

Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases:
  1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable.
  2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins?
  3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant.

# Test
```
$ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes

torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546
Approved by: https://github.com/aakhundov
2024-03-09 01:49:42 +00:00
albanD
6791b0c09e Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632)
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
2024-03-09 01:08:37 +00:00
David Berard
0f8c9acc29 Revert "[fake_impls] Fix seed/offset device for attention kernels (#120839)" (#121447)
This reverts commit df3c8b8390.

It regressed cudagraphs+PT2 performance on SDPA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121447
Approved by: https://github.com/Chillee
2024-03-08 01:48:23 +00:00
Tugsbayasgalan Manlaibaatar
4f120dc2a6 Clean up mode handling in python dispatcher (#121083)
Things that were bad before this PR:
1. Temporarily unsetting functional tensor mode and proxy mode both had duplicate implementation
2. There are variants of mode handling private utils that has duplicate implementation. (different APIs calling repeated implementation, so i refactored)
3. _push_mode API used to take dispatch key argument which is not necessary.
4. There are unused APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121083
Approved by: https://github.com/zou3519
2024-03-08 00:30:34 +00:00
Joel Schlosser
ea8f6e2e54 Subclass view fake-ification via reified ViewFuncs (#118405)
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
    * subclass -> dense
    * dense -> subclass
    * subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available

Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
2024-03-07 19:56:16 +00:00
Aaron Orenstein
edd80f87b8 Prevent infinite recursion within Tensor.__repr__ (#120206)
`Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself).

The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error.

Repro:
```
import torch
from torch.testing import make_tensor
from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2))
t2 = FakeTensor.from_tensor(t, FakeTensorMode())
print(repr(t2))
```
and run with `TORCH_LOGS=+all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206
Approved by: https://github.com/yanboliang, https://github.com/pearu
2024-03-07 02:24:45 +00:00
PaulZhang12
c66d68ba51 [PT2] Add tolist() to FunctionalTensor for torch.export (#121242)
Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242
Approved by: https://github.com/ezyang
2024-03-06 18:10:44 +00:00
angelayi
a7e93c341f [hoo] Add with_effects to handle side effectful ops (#120296)
Proposal: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bnm38nu3yfno
Implementation discussion: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bj61609o1buq

Result with print:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %with_effects : [num_users=1] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, aten.print.default, moo), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %arg1_1), kwargs = {})
    return [getitem, add]
```

Follow ups:
* Add handling to auto_functionalize
* Add support for tokens on the export side
* Add support for tokens on the inductor side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120296
Approved by: https://github.com/zou3519
2024-03-05 08:58:32 +00:00
David Berard
df3c8b8390 [fake_impls] Fix seed/offset device for attention kernels (#120839)
1) Fix fake_impls to return the correct device for these attention
   kernels.
2) Remove special-casing and test file xfails
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120839
Approved by: https://github.com/drisspg
2024-03-04 22:02:32 +00:00
Pearu Peterson
ce2903080c Add sparse compressed fake tensor support (#120920)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120920
Approved by: https://github.com/ezyang
2024-03-04 14:38:45 +00:00
Jason Ansel
0e0a621e0c [dynamo] Minor refactors (#120966)
These are changes I pulled out of the above PRs due to not being
related.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120966
Approved by: https://github.com/yanboliang
2024-03-03 02:20:48 +00:00
Pearu Peterson
b8e6ca6f76 Add sparse compressed meta tensor support (#120707)
As in the title.

Replaces https://github.com/pytorch/pytorch/pull/120498 and https://github.com/pytorch/pytorch/pull/120562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120707
Approved by: https://github.com/ezyang
ghstack dependencies: #120703
2024-03-01 13:28:47 +00:00
Tugsbayasgalan Manlaibaatar
c646030cd2 Support higher order op functionalization in predispatch IR (#115314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115314
Approved by: https://github.com/bdhirsh
2024-03-01 09:13:47 +00:00
James Wu
a911eb74ae [dynamo] Graph break when faking named tensors (#120779)
Fixes #120644
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120779
Approved by: https://github.com/zou3519
2024-02-29 18:22:15 +00:00
David Berard
df1e855313 [fake_impls] fix max_seqlen return values in efficient_attention_forward (#120842)
To match the actual implementation, we should return the max_seqlen_q/k, not M, N, when in the sparse case

7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L981-L996)

Note that although the .cu file sets max_seqlen_k = 0 in the sparse case, it actually returns max_seqlen_k or N:

7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L1224-L1231)

Tests - added in the next PR (#102839, which also fixes other parts of the test_fake tests so that we can un-xfail them and actually run the tests)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120842
Approved by: https://github.com/YuqingJ
ghstack dependencies: #120682
2024-02-29 07:12:27 +00:00
David Berard
d6c202975c Move attention kernels from meta_registrations to fake_impls (#120682)
This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns.

This PR:
* Move the `_meta_registrations.py` implementations to `fake_impls.py`
* Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them
* Wrap all the returned tensors in FakeTensors

Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682
Approved by: https://github.com/drisspg
2024-02-28 21:49:13 +00:00
Guilherme Leobas
491c2b4665 Let torch dynamo inline torch.func.grad (#118407)
When dynamo sees torch.func.grad, it tries to inline all frames related
to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407
Approved by: https://github.com/zou3519
2024-02-28 20:05:00 +00:00
Isuru Fernando
b7df3bba62 add decomposition for frexp (#119217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217
Approved by: https://github.com/peterbell10
ghstack dependencies: #119284, #120027
2024-02-23 21:52:42 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
8646872ff7 Make balance_gradient preserved in export (#120332)
Summary: We can only not-decompose CompositeImplicit functional custom ops. From the looks of the implementation, this op looks functional. So the fix is just fixing the schema.

Test Plan: CI

Differential Revision: D54019265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120332
Approved by: https://github.com/zhxchen17
2024-02-23 19:14:08 +00:00
Thiago Crepaldi
761fa5d6ec Add FakeTensor support to torch._utils._rebuild_tensor (#108186)
There are two scenarios:

* Scenario 1: The checkpoint was saved with pytorch < 1.6
* Scenario 2: The checkpoint was saved with pytorch >= 1.6

Repro Scenario 1:

```python
from torch._subclasses import fake_tensor
import transformers

fake_mode = fake_tensor.FakeTensorMode()
with fake_mode:
    fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")
```

Error:

```bash
Some weights of the model checkpoint at sshleifer/tiny-gpt2 were not used when initializing GPT2Model: ['lm_head.weight']
- This IS expected if you are initializing GPT2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:463 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    460 │   │   │   )                                                                             │
│    461 │   │   return safe_load_file(checkpoint_file)                                            │
│    462 │   try:                                                                                  │
│ ❱  463 │   │   return torch.load(checkpoint_file, map_location="cpu")                            │
│    464 │   except Exception as e:                                                                │
│    465 │   │   try:                                                                              │
│    466 │   │   │   with open(checkpoint_file) as f:                                              │
│                                                                                                  │
│ /opt/pytorch/torch/serialization.py:1030 in load                                                 │
│                                                                                                  │
│   1027 │   │   │   │   return _legacy_load(opened_file, map_location, _weights_only_unpickler,   │
│   1028 │   │   │   except RuntimeError as e:                                                     │
│   1029 │   │   │   │   raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None           │
│ ❱ 1030 │   │   return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args  │
│   1031                                                                                           │
│   1032                                                                                           │
│   1033 # Register pickling support for layout instances such as                                  │
│                                                                                                  │
│ /opt/pytorch/torch/serialization.py:1258 in _legacy_load                                         │
│                                                                                                  │
│   1255 │   _sys_info = pickle_module.load(f, **pickle_load_args)                                 │
│   1256 │   unpickler = UnpicklerWrapper(f, **pickle_load_args)                                   │
│   1257 │   unpickler.persistent_load = persistent_load                                           │
│ ❱ 1258 │   result = unpickler.load()                                                             │
│   1259 │                                                                                         │
│   1260 │   deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)                 │
│   1261                                                                                           │
│                                                                                                  │
│ /opt/pytorch/torch/_utils.py:201 in _rebuild_tensor_v2                                           │
│                                                                                                  │
│   198 def _rebuild_tensor_v2(                                                                    │
│   199 │   storage, storage_offset, size, stride, requires_grad, backward_hooks, metadata=None    │
│   200 ):                                                                                         │
│ ❱ 201 │   tensor = _rebuild_tensor(storage, storage_offset, size, stride)                        │
│   202 │   tensor.requires_grad = requires_grad                                                   │
│   203 │   if metadata:                                                                           │
│   204 │   │   set_tensor_metadata(tensor, metadata)                                              │
│                                                                                                  │
│ /opt/pytorch/torch/_utils.py:180 in _rebuild_tensor                                              │
│                                                                                                  │
│   177 def _rebuild_tensor(storage, storage_offset, size, stride):                                │
│   178 │   # first construct a tensor with the correct dtype/device                               │
│   179 │   t = torch.tensor([], dtype=storage.dtype, device=storage._untyped_storage.device)      │
│ ❱ 180 │   return t.set_(storage._untyped_storage, storage_offset, size, stride)                  │
│   181                                                                                            │
│   182                                                                                            │
│   183 def get_tensor_metadata(tensor):                                                           │
│                                                                                                  │
│ /opt/pytorch/torch/utils/_stats.py:20 in wrapper                                                 │
│                                                                                                  │
│   17 │   │   if fn.__qualname__ not in simple_call_counter:                                      │
│   18 │   │   │   simple_call_counter[fn.__qualname__] = 0                                        │
│   19 │   │   simple_call_counter[fn.__qualname__] = simple_call_counter[fn.__qualname__] + 1     │
│ ❱ 20 │   │   return fn(*args, **kwargs)                                                          │
│   21 │   return wrapper                                                                          │
│   22                                                                                             │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1160 in __torch_dispatch__                         │
│                                                                                                  │
│   1157 │   def __torch_dispatch__(self, func, types, args=(), kwargs=None):                      │
│   1158 │   │   assert self not in _get_current_dispatch_mode_stack(), func                       │
│   1159 │   │   try:                                                                              │
│ ❱ 1160 │   │   │   return self.dispatch(func, types, args, kwargs)                               │
│   1161 │   │   except TypeError:                                                                 │
│   1162 │   │   │   log.exception("fake tensor raised TypeError")                                 │
│   1163 │   │   │   raise                                                                         │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1318 in dispatch                                   │
│                                                                                                  │
│   1315 │   │                                                                                     │
│   1316 │   │   # we are falling through to running non constant tensors, any input constant tha  │
│   1317 │   │   # is written to must be invalidated                                               │
│ ❱ 1318 │   │   self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)   │
│   1319 │   │                                                                                     │
│   1320 │   │   # Try for fastpath                                                                │
│   1321 │   │   if has_symbolic_sizes:                                                            │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1557 in invalidate_written_to_constants            │
│                                                                                                  │
│   1554 │   │   any_constant = any(e.constant is not None for e in flat_arg_fake_tensors)         │
│   1555 │   │   if any_constant and get_schema_info(func).is_mutable():                           │
│   1556 │   │   │   schema_info = get_schema_info(func)                                           │
│ ❱ 1557 │   │   │   _, new_kwargs = normalize_function(                                           │
│   1558 │   │   │   │   func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True         │
│   1559 │   │   │   )                                                                             │
│   1560 │   │   │   for k, v in new_kwargs.items():                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:297 in normalize_function                              │
│                                                                                                  │
│   294 │   │   new_args_and_kwargs = _args_kwargs_to_normalized_args_kwargs(sig, args, kwargs,    │
│   295 │   else:                                                                                  │
│   296 │   │   assert callable(target)                                                            │
│ ❱ 297 │   │   torch_op_schemas = get_signature_for_torch_op(target)                              │
│   298 │   │   matched_schemas = []                                                               │
│   299 │   │   if torch_op_schemas:                                                               │
│   300 │   │   │   # Iterate through all of the schema until we find one that matches             │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:167 in get_signature_for_torch_op                      │
│                                                                                                  │
│   164 │   │   │   return (None, None) if return_schemas else None                                │
│   165 │   │   schemas = torch._C._jit_get_schemas_for_operator(aten_fn)                          │
│   166 │                                                                                          │
│ ❱ 167 │   signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]          │
│   168 │   return (signatures, schemas) if return_schemas else signatures                         │
│   169                                                                                            │
│   170 @compatibility(is_backward_compatible=False)                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:167 in <listcomp>                                      │
│                                                                                                  │
│   164 │   │   │   return (None, None) if return_schemas else None                                │
│   165 │   │   schemas = torch._C._jit_get_schemas_for_operator(aten_fn)                          │
│   166 │                                                                                          │
│ ❱ 167 │   signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]          │
│   168 │   return (signatures, schemas) if return_schemas else signatures                         │
│   169                                                                                            │
│   170 @compatibility(is_backward_compatible=False)                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:70 in _torchscript_schema_to_signature                 │
│                                                                                                  │
│    67 │   from inspect import Parameter                                                          │
│    68 │   parameters : List[Parameter] = []                                                      │
│    69 │   for arg in ts_schema.arguments:                                                        │
│ ❱  70 │   │   arg_type = _torchscript_type_to_python_type(arg.type)                              │
│    71 │   │   default = arg.default_value if arg.has_default_value() else Parameter.empty        │
│    72 │   │   # TODO: Figure out if this is safe. It seems like when generating the type signa   │
│    73 │   │   # PythonArgParser, we emit signatures with `input` instead of `self` as the firs   │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:64 in _torchscript_type_to_python_type                 │
│                                                                                                  │
│    61 │   eval'ing the annotation_str. _type_eval_globals sets up expressions                    │
│    62 │   like "List" and "Future" to map to actual types (typing.List and jit.Future)           │
│    63 │   """                                                                                    │
│ ❱  64 │   return eval(ts_type.annotation_str, _type_eval_globals)                                │
│    65                                                                                            │
│    66 def _torchscript_schema_to_signature(ts_schema : torch._C.FunctionSchema) -> inspect.Sig   │
│    67 │   from inspect import Parameter                                                          │
│ <string>:1 in <module>                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'Storage' is not defined

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:467 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    464 │   except Exception as e:                                                                │
│    465 │   │   try:                                                                              │
│    466 │   │   │   with open(checkpoint_file) as f:                                              │
│ ❱  467 │   │   │   │   if f.read(7) == "version":                                                │
│    468 │   │   │   │   │   raise OSError(                                                        │
│    469 │   │   │   │   │   │   "You seem to have cloned a repository without having git-lfs ins  │
│    470 │   │   │   │   │   │   "git-lfs and run `git lfs install` followed by `git lfs pull` in  │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/codecs.py:322 in decode                                       │
│                                                                                                  │
│    319 │   def decode(self, input, final=False):                                                 │
│    320 │   │   # decode input (taking the buffer into account)                                   │
│    321 │   │   data = self.buffer + input                                                        │
│ ❱  322 │   │   (result, consumed) = self._buffer_decode(data, self.errors, final)                │
│    323 │   │   # keep undecoded input until the next call                                        │
│    324 │   │   self.buffer = data[consumed:]                                                     │
│    325 │   │   return result                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/pytorch/bug_repro.py:16 in <module>                                                         │
│                                                                                                  │
│   13 fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")                  │
│   14 assert fake_model is not None                                                               │
│   15 with fake_mode:                                                                             │
│ ❱ 16 │   fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")  # raises    │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:484 in │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   481 │   │   │   )                                                                              │
│   482 │   │   elif type(config) in cls._model_mapping.keys():                                    │
│   483 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
│ ❱ 484 │   │   │   return model_class.from_pretrained(                                            │
│   485 │   │   │   │   pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs,   │
│   486 │   │   │   )                                                                              │
│   487 │   │   raise ValueError(                                                                  │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:2604 in          │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   2601 │   │   if from_pt:                                                                       │
│   2602 │   │   │   if not is_sharded and state_dict is None:                                     │
│   2603 │   │   │   │   # Time to load the checkpoint                                             │
│ ❱ 2604 │   │   │   │   state_dict = load_state_dict(resolved_archive_file)                       │
│   2605 │   │   │                                                                                 │
│   2606 │   │   │   # set dtype to instantiate the model under:                                   │
│   2607 │   │   │   # 1. If torch_dtype is not None, we use that dtype                            │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:479 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    476 │   │   │   │   │   │   "model. Make sure you have saved the model properly."             │
│    477 │   │   │   │   │   ) from e                                                              │
│    478 │   │   except (UnicodeDecodeError, ValueError):                                          │
│ ❱  479 │   │   │   raise OSError(                                                                │
│    480 │   │   │   │   f"Unable to load weights from pytorch checkpoint file for '{checkpoint_f  │
│    481 │   │   │   │   f"at '{checkpoint_file}'. "                                               │
│    482 │   │   │   │   "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please s  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OSError: Unable to load weights from pytorch checkpoint file for '/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin' at
'/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set
from_tf=True.
```

Repro scenario 2:

```python
import tempfile
import torch
from torch._subclasses import fake_tensor

class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc1 = torch.nn.Linear(5, 10)

    def forward(self, x):
        return self.fc1(x)

with tempfile.NamedTemporaryFile() as state_dict_file:
    # Create state_dict to be loaded later
    model = TheModelClass()
    torch.save(model.state_dict(), state_dict_file.name)

    fake_mode = fake_tensor.FakeTensorMode()
    with fake_mode:
        # This is where the bug is triggered
        state_dict = torch.load(state_dict_file.name)
```

Error:

```bash
Traceback (most recent call last):
  File "issue_gh_torch_105077.py", line 22, in <module>
    state_dict = torch.load(state_dict_file.name)
  File "/opt/pytorch/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/opt/pytorch/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
    self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
    _, new_kwargs = normalize_function(
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
    torch_op_schemas = get_signature_for_torch_op(target)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
    arg_type = _torchscript_type_to_python_type(arg.type)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
    return eval(ts_type.annotation_str, _type_eval_globals)
  File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```

This PR adds the ability to create fake tensors during torch.load (when fake mode is active) by changing the storage's device to 'meta'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang, https://github.com/atalman
2024-02-16 23:42:50 +00:00
Sam Larsen
3e5e8590f4 Account for inference mode in FakeTensor cache (#119963)
Summary: an fbcode test exposed a shortcoming where we serve a FakeTensor from the cache with the wrong inference_mode. Take the current mode into account in the cache key so we only serve entries from the same mode we're in currently

Test Plan: New unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119963
Approved by: https://github.com/eellison
2024-02-16 02:53:33 +00:00
Edward Z. Yang
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
Edward Z. Yang
dc4779b010 Split out fake_impls from fake_tensor (#118878)
The motivation is fake_tensor is marked as an uninteresting file for the purposes of backtraces, but operator implementations in fake tensor are interesting and I do want them reported.

How did I decide whether or not to move helper functions or not? It was kind of random, but if they weren't used in fake tensor generally I moved them over.

There are no functional code changes, so you only need to review the import changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118878
Approved by: https://github.com/eellison
2024-02-01 23:50:56 +00:00
Edward Z. Yang
f63dc9a21d s/DIRECLTY/DIRECTLY/ (#118877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118877
Approved by: https://github.com/albanD
2024-02-01 20:25:58 +00:00
Aaron Gokaslan
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
Edward Z. Yang
9bce208dfb Replace follow_imports = silent with normal (#118414)
This is a lot of files changed! Don't panic! Here's how it works:

* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.

In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.

The codemod was done with this script authored by GPT-4:

```
import glob

exclude_patterns = [
    ...
]

for pattern in exclude_patterns:
    for filepath in glob.glob(pattern, recursive=True):
        if filepath.endswith('.py'):
            with open(filepath, 'r+') as f:
                content = f.read()
                f.seek(0, 0)
                f.write('# mypy: ignore-errors\n\n' + content)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
2024-01-27 02:44:11 +00:00
Sam Larsen
4e45d791e7 Remove set_ exclusion in FakeTensor dispatch cache (#118154)
Summary: Now that set_ is marked as a view op, this special case is no longer necessary

Test Plan: CI exposed the need for this special case in the first place, so I think we can just rely on the existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118154
Approved by: https://github.com/bdhirsh
2024-01-25 21:54:36 +00:00
Tugsbayasgalan Manlaibaatar
fa1e89b337 Ban mutation on dropout outputs in export (#117879)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117879
Approved by: https://github.com/ezyang
ghstack dependencies: #117811
2024-01-21 04:53:40 +00:00
Sam Larsen
208e64a9ba Initial implementation of FakeTensor caching (#113873)
Summary: Cache the result of FakeTensor dispatch and skip re-evaluation on cache hits.

Test Plan: New unit tests. Caching is enabled in this diff, so all existing tests exercise the cache as well.

Differential Revision: [D52841637](https://our.internmc.facebook.com/intern/diff/D52841637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113873
Approved by: https://github.com/eellison
2024-01-17 20:38:54 +00:00
Tugsbayasgalan Manlaibaatar
81f98f1082 Experimental non-strict mode (#114658)
This is proof-of-concept implementation of how people can use a marker `mark_strict` to enable torchdynamo while exporting under non-strict mode. The main idea is that `mark_strict` will turn into an HOO which then utilizes dynamo to do correctness analysis in the same way how torch.cond works today. There are some notable limitations:
1. This API is not meant for public use yet
2. Strict region can't work with arbitrary container inputs
3. We don't preserve `nn_module_stack` and other node metadata for the strict region.
4. strict_mode HOO will show up in the final graph. This is undesirable in the long term, but for short term experiments, it should be good enough. Will fix this in the follow up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114658
Approved by: https://github.com/ydwu4
2024-01-04 12:24:58 +00:00
Tugsbayasgalan Manlaibaatar
dfc898ede4 Don't decompose functional ops in predispatch functionalization (#116383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116383
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115188, #115210
2023-12-28 11:54:04 +00:00
Tugsbayasgalan Manlaibaatar
76b1d44d57 pre_dispatch aot_export (#115188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188
Approved by: https://github.com/bdhirsh
2023-12-25 04:51:21 +00:00
Jason Ansel
dbbe8485b4 Fake Tensor refactors part 2 (#116345)
This should help trace time a bit.
This refactors `op_implementations` (which requires O(n) checks per op) to mostly use a dict with O(1) cost per op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116345
Approved by: https://github.com/yanboliang
2023-12-24 04:54:50 +00:00
Jason Ansel
4912922297 Fake Tensor refactors part 1 (#116344)
These are mostly small performance optimizations to move constant list construction into global scope and replace O(n) `x in list` checks with O(1) `x in dict` checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116344
Approved by: https://github.com/yanboliang
2023-12-23 08:38:26 +00:00
PyTorch MergeBot
0567f71ac6 Revert " pre_dispatch aot_export (#115188)"
This reverts commit a267d67350.

Reverted https://github.com/pytorch/pytorch/pull/115188 on behalf of https://github.com/jeanschmidt due to sadly, it is required to revert this commit in order to revert https://github.com/pytorch/pytorch/pull/115454 ([comment](https://github.com/pytorch/pytorch/pull/115188#issuecomment-1866310014))
2023-12-21 14:03:18 +00:00
Tugsbayasgalan Manlaibaatar
a267d67350 pre_dispatch aot_export (#115188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188
Approved by: https://github.com/bdhirsh
2023-12-20 21:36:25 +00:00
Joel Schlosser
29b198dcf8 Add markDynamoStrictTest to NT tests (#116111)
Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing.

Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111
Approved by: https://github.com/soulitzer, https://github.com/zou3519
ghstack dependencies: #115192
2023-12-20 20:13:20 +00:00
Guilherme Leobas
1be6a070bc Add support for torch.cond in vmap (#114523)
Fixes: https://github.com/pytorch/pytorch/issues/114136

Patch enables conversion of a BatchedTensor into FakeTensor and write
torch.cond vmap support using torch.where

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114523
Approved by: https://github.com/zou3519
2023-12-20 19:54:38 +00:00
Elias Ellison
6ffe1da375 Add support for multi device foreach ops (#116064)
Fix for https://github.com/pytorch/pytorch/issues/102023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116064
Approved by: https://github.com/mlazos
2023-12-20 04:19:40 +00:00
Tugsbayasgalan Manlaibaatar
d85314c95c Support Predispatch functionalization (#113728)
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:

1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.

2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.

3.  We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.

Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
2023-12-19 20:28:35 +00:00
rzou
99257002fa Extend auto_functionalized to support ops that return Tensors (#115135)
We can auto-functionalize operators that mutate their inputs as long as
the outputs of the operator do not alias their inputs. The user needs
to provide an abstract impl for the operator if it has non-trivial
returns.
- We update can_auto_functionalize(op) to include ops that return (but
  do not alias) Tensors
- We update auto_functionalized(op, mutated_args_names, kwargs) to
  return (out, mutated_args), where `out = op(**kwargs)` and
  `mutated_args` are the new values of the inputs that would have been
  mutated.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115135
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956, #115134
2023-12-05 22:43:06 +00:00
rzou
d0aad93249 Refactor can_auto_functionalize (#115134)
In preparation for the next PR up in the stack, which is going to update
"can_auto_functionalize" to support more operators than just ones that
return nothing. We are unable to auto-generate FakeTensor kernels for
operators that do not return nothing, but we are able to generate
functionalization kernels for operators that return something.

Test Plan:
Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115134
Approved by: https://github.com/bdhirsh
ghstack dependencies: #114955, #114956
2023-12-05 22:43:06 +00:00
Joel Schlosser
22704426c3 Expand dynamic dims support for traceable subclasses (#114311)
Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo).

Summary:
* Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors
    * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance
    * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors *of the same dim as outer* when `mark_dynamic(outer, ...)` is called
    * Addresses this: 6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)
* Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols)
    * Signatures now:
    ```python
    # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr)
    # ctx is anything useful for rebuilding the class we want to guard on
    attrs, ctx = x.__tensor_flatten__()
    ...
    # inner_tensors is a dict of {attr -> tensor}
    # ctx is taken unmodified from flattening and (eventually) guarded on
    # outer_size is the expected size of the output; possibly symbolic
    # outer_stride is the expected strides of the output; possibly symbolic
    y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride)

    # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride
    # the assert simplifies symbols when there are relationships between outer and inner symbols
    ```
    * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least
    * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now
* ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work)
* ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~
    * Now handled in #114469
* Next PR: add TENSOR_MATCH guards on inner tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311
Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-12-05 21:09:25 +00:00
rzou
cfa4370c07 torch.compile should auto-functionalize certain mutable ops (#114955)
Users may wish to torch.compile custom ops that mutate their inputs
and return nothing (this is a common class of operators).
torch.compile will automatically support this op without anyone needing
to provide a functionalization kernel for it. Here's how.

Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> ()
op. First, when FakeTensor sees this op, it can just return None.
This is the case because custom ops are not allowed to mutate input
metadata, so the FakeTensor rule for one that returns nothing is trivial.

Next, when Python FunctionalTensor sees the op, it will functionalize
it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...})
HOP and replacing the mutated inputs with the outputs of this HOP.
This HOP effectively runs the functional version of the op when
called: it clones inputs that will be mutated, runs the op, and
then returns Tensors with the new values.

In the future we can teach Inductor how to do re-inplacing when it sees
this HOP (like how triton kernels do it) but this isn't urgent (and is
more of a performance problem).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114955
Approved by: https://github.com/bdhirsh
2023-12-05 14:53:08 +00:00
Antoni Viros
d47f715d29 Expose Flash attn to autograd (#114378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114378
Approved by: https://github.com/drisspg
2023-12-01 23:42:06 +00:00
voznesenskym
081c5b3adc Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926) (#114526)
Summary:

The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.

This PR is the result of *a lot* of back and forth with ezyang and eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:

1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)

This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).

We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (ezyang did this)

cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: huydhn, Chillee

Differential Revision: D51566250

Pulled By: voznesenskym

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114526
Approved by: https://github.com/Chillee, https://github.com/huydhn
2023-11-26 23:40:32 +00:00
PyTorch MergeBot
2f3beb715c Revert "Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926)"
This reverts commit 2ca1119d53.

Reverted https://github.com/pytorch/pytorch/pull/113926 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113926#issuecomment-1822713852))
2023-11-22 12:52:33 +00:00
voznesenskym
2ca1119d53 Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926)
The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors *at the end* of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor.

This PR is the result of *a lot* of back and forth with @ezyang and @eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same:

1) We cache source->symbol in shape_env
2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification
3) We create a new fake mode for backends
(from https://github.com/pytorch/pytorch/pull/113605/files)

This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't).

We went back to the drawing board here, but with a few concessions:
1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons
2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (@ezyang did this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113926
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-11-20 23:06:37 +00:00
Edward Z. Yang
edc5ae3113 Allow for calling lift_fresh_copy manually (#113923)
In this case, the input could be fake!  Just treat it normally in that case.

Fixes https://github.com/pytorch/pytorch/issues/113331

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113923
Approved by: https://github.com/eellison, https://github.com/bdhirsh, https://github.com/leslie-fang-intel
2023-11-19 07:13:49 +00:00
Edward Z. Yang
e2b114ab9f [BE] Package dynamic_dims/constraint_dims into CreateSymbolicPolicy (#113802)
This will make it more convenient to propagate more information through
all of these functions in the future (e.g., for storage offset
information.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113802
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-11-17 18:22:46 +00:00
eellison
f8eb46d623 index put device error checking (#113729)
Fix for https://github.com/pytorch/pytorch/issues/101371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113729
Approved by: https://github.com/bdhirsh
2023-11-16 00:39:04 +00:00
voznesenskym
6435fc17bb Remove ignore_sublcass from FakeTensorMode (#113795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113795
Approved by: https://github.com/ezyang
2023-11-15 22:30:13 +00:00
voznesenskym
b8b3c26d3d If we re-fakeify a FakeTensor with the same ShapeEnv, preserve symbols (#113651)
Subsumes half of https://github.com/pytorch/pytorch/pull/113605

We support fakeifying an already fake tensor, which will give you a new fake tensor mirroring the same structure as the original fake tensor, which is what is needed by https://github.com/pytorch/pytorch/issues/113643 . However, when this refakeification happens, we will naively reallocate all new sizes for all of the fake tensor. This is the right thing to do if you are re-fakeifying on a fresh ShapeEnv (because you're reparametrizing the sizes or something), but if you have two fake tensor modes which are sharing a shape environment, you would actually rather just reuse the original sizes/strides/offset from the original fake tensor. This ends up being pretty simple. I recommend viewing with whitespace diff turned off.

There's some fuzz around jagged tensor handling; that code is probably not quite right, but I fixed it for this particular case in the most straightforward way.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113651
Approved by: https://github.com/albanD, https://github.com/eellison, https://github.com/bdhirsh
2023-11-15 00:36:04 +00:00
Antoni Viros
1aece432ba Implement narrow from a regular tensor to jagged tensor (#112770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112770
Approved by: https://github.com/cpuhrsch
2023-11-13 19:09:59 +00:00
drisspg
c46fc46dba expose mem-eff to autograd (#110495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110495
Approved by: https://github.com/jbschlosser
2023-11-13 17:47:40 +00:00
Peter Bell
718035791d Prefer e.is_number over not e.free_symbols in SymPy (#112688)
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.

> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.

Even further, we also avoid the overhead of building the unnecessary set object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
2023-11-06 20:05:13 +00:00
Yukio Siraichi
132cb57e47 Skip aliasing correction for lift_fresh. (#112202)
Fix: #111506

This PR skips aliasing correction on `lift_fresh` calls. Reasoning is: although unlifted and lifted tensors are technically aliases, they are from different levels of abstraction (`FunctionalTensorWrapper` and `XLATensor`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112202
Approved by: https://github.com/bdhirsh
2023-11-03 20:46:30 +00:00
soulitzer
53fff56ab8 Graph break cleanly for test_nestedtensor (#112662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112662
Approved by: https://github.com/jbschlosser
2023-11-03 07:20:43 +00:00
Edward Z. Yang
a1ab22b81d Reland "Trigger specialization when you call size()/stride() from C++ (#111935)" (#112605)
This reverts commit 22221c6d60.

Differential Revision: [D50886564](https://our.internmc.facebook.com/intern/diff/D50886564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112605
Approved by: https://github.com/voznesenskym
2023-11-02 13:27:31 +00:00
Edward Z. Yang
793c62b79c Allow binary pointwise operations to cause refinement on unbacked SymInts (#112155)
To do this, there is a little detour to remove hint caching for unbacked
SymInts; now, we just always attempt to update the hint (using
maybe_evaluate_static; this is much better than the replace we were
doing before) if we don't think we know it.

With this change, we now can generally infer that i0 == 1 is false for
a size-like unbacked SymInt.  So if we write the size match /
broadcasting test very carefully (see comment), we will eventually
end up expect_true(sizeA == sizeB), which is good enough to cause
refinement.  Phew!

I think I still want to setup a replacement if you do i0 == s0, but I'm
going to do that in a follow up.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112155
Approved by: https://github.com/aakhundov, https://github.com/voznesenskym
2023-11-01 23:02:17 +00:00
Peter Bell
9e89c36a54 [FakeTensor] Reuse flat_args throughout FakeTensorMode.dispatch (#112418)
This function repeatedly flattens and unflattens the `args, kwargs` pair so we
get a quite significant perf improvement from saving the `flat_args` and
operating directly on those. I see a 15% improvement in dispatch for
`empty_strided`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112418
Approved by: https://github.com/lezcano
2023-11-01 14:37:28 +00:00
Peter Bell
04024926f4 Use pytree.tree_map_ everywhere (#112417)
Wherever we discard the output of `tree_map` it's better to call `tree_map_`
which doesn't unflatten the mapped results and so is a lot cheaper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112417
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393, #112394
2023-10-31 15:57:06 +00:00
Peter Bell
66c32d099a Use pytree.arg_tree_leaves everywhere (#112394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112394
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393
2023-10-31 15:57:06 +00:00
soulitzer
3b58755c1c Fix FakeTensor tolist when size is not symbolic (#112206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112206
Approved by: https://github.com/ezyang
ghstack dependencies: #112205
2023-10-30 19:25:10 +00:00
soulitzer
0cda4c8abe Replay view with view_func instead of as_strided in meta_utils for NT (#112205)
Currently meta_utils relies on as_strided when handling the view case (recursively meta-ify the base, and then do as_strided to simulate the view), but NestedTensor does not support as_strided today (though maybe it could?), so what we want to do instead is call Tensor. _view_func. Conveniently,  _view_func IS always available for nested tensors.

A detail to note is that _view_func actually incurs a guard because it needs to perform some metadata checks to make sure the view is still valid. This PR adds Tensor._unsafe_view_func which can avoid that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112205
Approved by: https://github.com/jbschlosser
2023-10-30 19:25:10 +00:00
Peter Bell
bbd5b935e4 Use pytree.tree_leaves everywhere (#112324)
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327, #112323
2023-10-30 03:39:04 +00:00
PyTorch MergeBot
22221c6d60 Revert "Trigger specialization when you call size()/stride() from C++ (#111935)"
This reverts commit 5846705e36.

Reverted https://github.com/pytorch/pytorch/pull/111935 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111935#issuecomment-1782107024))
2023-10-27 00:23:03 +00:00
lezcano
c8a5bb451e Do not import sympy within torch._prims_common (#112034)
This is the first of a few PRs that avoid importing SymPy at import time.
The pitch here is that we (almost!) do not have SymPy on our API, so
this should be feasible.

This should speed-up torch imports by a good 15% as per
https://dev-discuss.pytorch.org/t/delving-into-what-happens-when-you-import-torch/1589

In this PR we just move a few global imports into local imports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112034
Approved by: https://github.com/ezyang
2023-10-26 12:53:25 +00:00
Jon Chuang
0ed461ae4c [dynamo] Ensure Dynamo uses this graph's fakes for Tensor example_values (#111954)
Fixes https://github.com/pytorch/pytorch/issues/111869, Fixes (detailed list of cases handled): https://github.com/pytorch/pytorch/pull/111913#discussion_r1370267313, fully fixes: https://github.com/pytorch/pytorch/issues/111873

Adds sanity checks ensuring that Dynamo uses this graph's fakes for Tensor `example_values`

Handles the main (and only?) entrypoints for new `FakeTensor`s in a Dynamo graph:
- `wrap_fx_proxy_cls`
- `VariableTracker.wrap_tensor`

Ensures that `get_fake_value` returns a fake except when we know we are going to properly wrap non-fakes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111954
Approved by: https://github.com/ezyang
2023-10-25 23:54:18 +00:00
Edward Z. Yang
5846705e36 Trigger specialization when you call size()/stride() from C++ (#111935)
This should be the last of the "it used to work with static shapes but
it doesn't work with dynamic shapes" hard errors.  Now we will just
specialize if you hit it from C++.

The strategy here is a bit clever.  We shunt the size() call to Python
binding if an error would have occurred.  Importantly, we already have
logic to make sure the newly allocated ints stay live for the duration
of the ArrayRef access.

storage_offset is intentionally omitted because there are some problems
with it.  I will fix them next.

This should let us get rid of the aotautograd_static test configuration.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111935
Approved by: https://github.com/zou3519
2023-10-25 16:17:55 +00:00
soulitzer
73170b23d4 Add compile support for NT unbind (#111531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111531
Approved by: https://github.com/ezyang
2023-10-23 21:16:20 +00:00
Oguz Ulgen
4e310fd875 [Autograd] Track when mutations are for triton kernels (#111500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111500
Approved by: https://github.com/bdhirsh
2023-10-19 15:34:34 +00:00
cdzhan
46000bede6 Fix a typo in fake tensor test. (#111193)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111193
Approved by: https://github.com/janeyx99
2023-10-17 03:36:28 +00:00
Brian Hirsh
4cf23c6a61 FunctionalTensor: avoid spurious not_implemented logging during proxy tracing (#111040)
This is kind of hard to test, but I can try to add a test case if requested.

I noticed locally that we now end up logging to the ProxyTensorMode and FakeTensorMode `not_implemented` logs in very simple compile examples: https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L269

It was because `_mirror_autograd_meta_to()` indirectly queries sizes, and since modes have higher priority than subclasses, `aten::sym_sizes()` was getting dispatched to our modes before going to `FunctionalTensor.__torch_dispatch__`.

This works out fine (they return NotImplemented and we eventually get to `FunctionalTensor`) but I figured we want to avoid cluttering up the logs. So I wrapped the calls with `FunctionalTensorMode`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111040
Approved by: https://github.com/ezyang
2023-10-16 15:18:20 +00:00
Yu, Guangye
918054f422 [Inductor] support channel last for xpu conv in inductor layout opt path (#111018)
# Motivation
support xpu channel last for inductor layout optimization path.
Currently, `_conv_determine_backend_memory_format` always returns torch.contiguous_format for XPU conv.

# Solution
Add xpu channel last detection stragey in `determine_backend_memory_format`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111018
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/EikanWang
2023-10-12 15:13:50 +00:00
soulitzer
110382bacf Make NestedTensor compilable with eager backend (#109171)
In this PR:
- Adds support for strides for jagged tensor (design doc for this coming soon)
- NestedTensor skips automatic dynamic
- Make use of @bdhirsh's subclass fakification logic by adding the __tensor_{un,}flatten__ functions.
- Additional logic for fakification: since existing subclass fakification logic does not handle the case where the outer tensor has an additional dimension. We insert one-off logic to (1) insert an extra SingletonSymInt onto the fakified NestedTensor. (2) make sure we call track_symint on both the sizes on the inner and outer tensor during guard creation.

Remaining things that are weird:
- Still need to skip some logic in meta utils for some reason (I was going to write this up more, but decided not to since we're not able to do this anyway for a immediate reason: we cannot arbitrarily compare singleton ints. For now I'm just following Brian's advise from [here](https://github.com/pytorch/pytorch/pull/109171#discussion_r1328137070) )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109171
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-10-11 04:47:10 +00:00
drisspg
e0dbaa04d2 Fix the meta func for mem_eff_backward (#110893)
Fixes #110832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110893
Approved by: https://github.com/eellison
2023-10-11 02:58:54 +00:00
angelayi
096b14eae8 Fix numel test to be > 2 (#110731)
This makes it consistent with the comment.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110731
Approved by: https://github.com/angelayi
2023-10-07 19:18:59 +00:00
Oguz Ulgen
f04b1a0d27 [AOTInductor] Implement autograd eager backend for native triton kernels (#110403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110403
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2023-10-04 17:56:56 +00:00
Brian Hirsh
b457e3f79a Reland attempt 2 of "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)" (#110079)
The first reland broke internal (failing diff: D49617462).

The major error looks like it's because there's an internal-only higher order op that needs a new functionalization rule. I'm going to land an internal diff for that and confirm tests pass before relanding this PR.

Also confirmed that the issue from https://github.com/pytorch/pytorch/issues/110121 is fixed, and added a test.

This reverts commit 1b90f07f5a.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110079
Approved by: https://github.com/ezyang
2023-10-03 18:50:25 +00:00
Edward Z. Yang
f7c9ef88f5 Add masked_select abstract impl (#110103)
Fixes https://github.com/pytorch/pytorch/issues/109871

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110103
Approved by: https://github.com/bdhirsh
2023-09-27 04:07:58 +00:00
rzou
f8fcc54f70 Add torch.library.impl_abstract (#109912)
Changelog:
- torch.library.impl_abstract optionally accepts a torch.library.Library
  object. If passed in, then the lifetime of the registration is tied to
  the Library object.
- we've also changed torch.library.impl_abstract to work on all
  operators, including overloads.
- we refactored the `torch._custom_ops.*` and `torch._custom_op.*`
  impl_abstract APIs and put them under torch._library. This is the
  final resting place for them. I will follow-up with deleting
  all the `torch._custom_ops.*` stuff later.
- There is a new "SimpleOperatorRegistry" where we actually collect the
  abstract_impl. We will expand this to also hold the other
  torch._custom_ops.* APIs when we move those to torch.library

NB: Previously we had designed
`impl_abstract` assuming a very high-level Python-only custom op API.
We've revisited that since; now, impl_abstract works for all custom ops,
no matter python or C++, no matter the schema. The new refactored design
reflects this better.

Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109912
Approved by: https://github.com/ezyang
2023-09-26 01:59:50 +00:00
Moritz Hennen
09c598745c Rename torch._C._TensorBase to TensorBase (#109940)
I have gone ahead and implemented the renaming of the type `torch._C._TensorBase` to a non-private class name `TensorBase`.
The changes also include leaving `torch._C._TensorBase` as an alias to the new type: 70458768fb/torch/csrc/autograd/python_variable.cpp (L2196-L2197) both in the c++ code and in the corresponding `__init__.pyi.in` file:
70458768fb/torch/_C/__init__.pyi.in (L1522)

Fixes #109438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109940
Approved by: https://github.com/ezyang
2023-09-25 19:10:22 +00:00
Edward Z. Yang
b4ede53776 Use constrain_range_as_size for nonzero/repeat_interleave (#109857)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109857
Approved by: https://github.com/tugsbayasgalan
2023-09-22 12:14:46 +00:00
Brian Hirsh
63526a63f5 Make FunctionalTensor subclass to be more like functorch (interaction with ZeroTensor + Conjugate key) (#109023)
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.

Changes:

(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).

(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization

(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
2023-09-22 07:09:04 +00:00
rzou
8124a6c40c [TORCH_LIBRARY] Add impl_abstract_pystub (#109529)
We want users to be able to define custom ops in C++ but put the
abstract impl in Python (since it is easier to write them in Python and
the abstract impl better models device semantics and data-dependent
operators).

`m.impl_abstract_pystub(opname, python_module, context)` declares the
abstract_impl of the operator to exist in the given python module.
When the abstract_impl needs to be accessed (either via FakeTensor or
Meta), and it does not exist, the PyTorch Dispatcher will yell
with a descriptive error message.

Some details:
- We construct a new global AbstractImplPyStub mapping in
  Dispatcher.cpp. Read/write to this map is protected by the Dispatcher
  lock.
- We add a new Meta Tensor fallback kernel. The fallback errors out if there is
  no meta kernel, but also offers a nicer error message if we see that there is
  a pystub.
- We create a `torch._utils_internal.throw_abstract_impl_not_imported_error`
  helper function to throw errors. This way, we can throw different error
  messages in OSS PyTorch vs internal PyTorch. To invoke this from C++, we
  added a PyInterpreter::throw_abstract_impl_not_imported_error.

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753/)

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109529
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-09-22 04:55:36 +00:00
hauntsaninja
2cd0b94533 Hide __getattr__ from type checkers (#109683)
Visibility of this causes type checkers to conservatively assume that all attributes are defined on torch module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109683
Approved by: https://github.com/ngimel, https://github.com/ezyang, https://github.com/malfet
2023-09-21 17:01:23 +00:00
Brian Hirsh
238fb66085 python functionalization: support higher order ops (#108656)
We now have two types of functionalization, C++ Functionalization (through the `Functionalize` dispatch key), and python functionalization (through the `FunctionalTensorMode` torch_dispatch mode).

This means that all higher order ops need custom functionalization rules for the python variant too. I added them here, as well as a helper function `dispatch_functionalize()` - equivalent to `torch.func.functionalize()`, except that it uses `FunctionalTensorMode`.

In theory we could have secretly switched `torch.func.functionalize` to use `FunctionalTensorMode`. This would be BC-breaking, though, since `FunctionalTensorMode` isn't composable with the other functorch transforms (the functorch layer-mode stack doesn't know how to re-order torch_dispatch modes arbitrarily).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108656
Approved by: https://github.com/zou3519
ghstack dependencies: #109024, #109248
2023-09-20 04:37:31 +00:00
Brian Hirsh
25e81f19f3 reland "python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)" (#109518)
Reland - the previous PR was reverted by internal with this error:
```
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/363cd7e240f5d021/caffe2/torch/fb/trainer/data_modules/tests/__test_dataloader__/test_dataloader#link-tree/torch/__init__.py", line 29, in <module>
    from ._utils_internal import _functionalize_sync as _sync
ImportError: cannot import name '_functionalize_sync' from 'torch._utils_internal'
```

I couldn't figure out why internal was unhappy with the import. One potential reason is that I see a build rule for *another* `_utils_internal.py` in the fb folder here ([link](https://www.internalfb.com/code/fbsource/[30ed85cd88409af98b7490be137aaa5dfd7afd01]/fbcode/caffe2/TARGETS?lines=444))

Rather than burn more time investigating, I confirmed internally that the error goes away if I move the util from `torch/_utils_internal.py` to `torch/_utils.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109518
Approved by: https://github.com/albanD
2023-09-19 13:25:24 +00:00
PyTorch MergeBot
49b18ae546 Revert "python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)"
This reverts commit 0ad595954a.

Reverted https://github.com/pytorch/pytorch/pull/107917 on behalf of https://github.com/clee2000 due to breaking internal builds D49346637 ([comment](https://github.com/pytorch/pytorch/pull/107917#issuecomment-1722566885))
2023-09-17 20:57:41 +00:00
Brian Hirsh
0ad595954a python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)
Added two new utils to help with turning python functionalization on in AOTAutograd (next PR):

(1) updated `torch._sync()`. Previously, this API could only handle `torch.Tensor` instances that had a `FunctionalTensorWrapper` TensorImpl. It now needs to handle python `FunctionalTensor`'s. In theory I can probably break BC and change this API (since it's private?), but I decided not to do it in this PR stack do minimize the chance of reverts. Instead of updating that API directly (which is in C++), I just added a python shim that first tries to unwrap the python `FunctionalTensor` if there is one, then calls the existing C++ logic

(2) `mirror_autograd_meta` is now a standalone API that tries to mirror the `requires_grad` and `is_leaf` autograd metadata from one tensor to another. Previously this was hardcoded into `torch._to_functional_tensor()`. But I now need to use it in a more standalone way: later in AOTAutograd when we unwrap and re-wrap a tensor subclasses, we need to manually mirror the autograd metadata from the original to the updated version of the subclass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107917
Approved by: https://github.com/ezyang
ghstack dependencies: #106404
2023-09-15 20:19:25 +00:00
Brian Hirsh
f22b303f65 Add TorchDispatch version of functionalization (#106404)
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.

This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.

This PR provides the basic new classes, and some light testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang
2023-09-15 20:19:25 +00:00
Edward Z. Yang
55f956f1d2 optests improvements based on torchvision usage on nms (#108929)
- Update cross-ref FakeMode test to use ShapeEnv.  Dynamic ops can now
  return an unbacked SymInt.  We always accept this as equal to whatever
  the real value was.
- Relax test so it works on all classes, not just unittest.TestCase
- Properly wrap the original method, so things like
  pytree.mark.parametrize are carried over
- Support dynamic shapes by default for make_fx `tracing_mode="fake"` without symbolifying everything else

Fixes https://github.com/pytorch/pytorch/issues/108927

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108929
Approved by: https://github.com/zou3519
2023-09-13 13:26:15 +00:00
Edward Z. Yang
e5f300f085 Make mutation test work with quantized tensors (#108935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108935
Approved by: https://github.com/zou3519
2023-09-13 00:54:01 +00:00
Edward Z. Yang
2b9ad3d5c4 Fix setitem with SymInt (#108873)
Fixes https://github.com/pytorch/pytorch/issues/101939

Several fixes bundled together:

1. When we valueToTensor, we only handled non-symbolic inputs and not symbolic inputs. We support symbolic Scalar, so also handle symbolic values.
2. In the symbolic case, we MUST NOT lift_fresh, as you're not going to inline a constant into the graph, it's going to be from a `scalar_tensor` call (so no need to clone it to avoid mutations)
3. In indexing scalarToTensor, must not do the static, directly read out the scalar contents logic with the scalar is symbolic

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108873
Approved by: https://github.com/jansel
2023-09-10 06:44:22 +00:00
Edward Z. Yang
9b83402666 Add support for symbolic repeat_interleave (#108763)
Fixes https://github.com/pytorch/pytorch/issues/108195

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108763
Approved by: https://github.com/Chillee
2023-09-08 16:48:32 +00:00
vasiliy
3702980717 dynamo: trace autograd.Function with tensor subclass input (#108093)
Summary:

Enables dynamo eager mode tracing for the following situation:
1. we have a torch.autograd.Function
2. the input to that function is a tensor subclass which is an intermediary

This is useful for float8 training UX.

Test Plan:

```
python test/dynamo/test_autograd_function.py -k intermediary_input
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108093
Approved by: https://github.com/bdhirsh, https://github.com/wanchaol
2023-09-01 02:12:38 +00:00
Brian Hirsh
4f34caf164 add return_and_correct_aliasing() util for wrapper subclasses (#107915)
This PR adds a `return_and_correct_aliasing()` utility, that wrapper subclasses can use to get correct aliasing. I updated `TwoTensor` to use it, and added some testing that the aliasing of my `TwoTensor` subclass now matches the aliasing behavior of normal tensors.

Right now my test just uses a few hand-picked opinfos (that have varying aliasing behavior). I thought all op infos might be overkill (does that take a while to run?), but I'm happy to add them all if people prefer.

One more general question about this PR: eventually, proper aliasing will be a **requirement** in order for AOTAutograd to handle aliasing/mutations on subclasses properly during compilation. How can we make sure that wrapper subclasses use this API? A few options (from talking to Richard):

(1) Yolo require subclasses to use the API and hope users do as well (what this PR does)

(2) Yolo require subclasses to use the API, but add a kwarg to `_make_wrapper_subclass`, e.g. `manual_aliasing=True`, that torch.compile checks for before allowing the subclass to be used in compilation

(3) Automatically run this API in our python fallback, for **every** tensor subclass that currently implements `__tensor_flatten__` (aka only the "traceable" subclasses)

(4) Automatically run this API in our python fallback, for **every** tensor subclass. This would be a bit higher blast radius, since it would change the existing aliasing behavior of wrapper subclasses. Maybe.. this is the right thing to do though?

Either way, my tentative plan is to do (1) to unblock, and revisit this later once we want to come up with public docs + a more general "tensor subclass in PT2 requirements" plan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107915
Approved by: https://github.com/ezyang
2023-08-29 14:27:19 +00:00
Brian Hirsh
da54f3c519 reorder proxy / fake modes so they always run last (#104482)
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:

(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.

I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).

`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.

`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".

`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).

There were also some mild codegen changes to support the new enum

(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`

The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.

I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.

(2) dispatching logic in `python_arg_parser.cpp`

This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode()  # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```

Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.

How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.

(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD  pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).

----- original description below -----

The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit

Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.

**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**

This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.

I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.

This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.

**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**

This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.

**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**

Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.

**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"

**(5) Fixed subclass fakeification**

@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.

**(6) make tensor subclasses resizeable**

Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.

**(7) Added a basic DoubleTensor subclass for testing**

I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415
2023-08-29 02:36:48 +00:00
Brian Hirsh
5efd63b1b8 better support for fakeifying and dynamoing through torch_dispatch subclasses (with dynamic shapes) (#107415)
There is already some support for plumbing `__torch_dispatch__` tensor subclasses through dynamo, but this PR beefs it up a bit and adds a test. In particular:

(1) Fakeifying tensor subclasses didn't properly set autograd metadata (requires_grad, is_leaf) on the newly fakeified wrapper subclass. I don't actually have a test for this in this PR, but it's tested pretty heavily later in my aot autograd tests

(2) Fakeifying tensor subclasses didn't properly track source information for dynamic shapes on the inner tensors. I added a new `WrapperSubclassFieldSource` subclass, that represents a source coming from a tensor field on a wrapper subclass, which I use in the fakeifying logic, and again in symbolic_shapes.py to generate proper guards.

(3) `_make_wrapper_subclass()` marginally updated this code to work better with dynamic shapes. One thing that's a bit weird about `_make_wrapper_subclass`: it has two overloads, and the first explicitly does not support dynamic shapes (and the second.. does not support kwargs). I think that later we probably want to consolidate / at least make the first overload work with dynamic shapes, but I didn't want to handle that in this PR (so these smaller changes seemed like a strict improvement).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107415
Approved by: https://github.com/ezyang
2023-08-29 02:36:48 +00:00
leslie-fang-intel
9319dd1c7c [Quant][Inductor] Enable the lowering of quantized maxpool2d (#105906)
**Summary**
Enable the `dq-maxpool2d-q` pattern match and lower into `torch.ops.quantized.max_pool2d`.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qmaxpool2d
python -m pytest test_quantized_op.py -k test_max_pool2d_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105906
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456, #105639
2023-08-26 08:36:47 +00:00
Vishwa Raj Singh
35de780aa6 Fix Inplace tensor update on transpose (#104689)
Fixes #https://github.com/pytorch/pytorch/issues/103650

- To align with HPU device backend architecture.
   Ensure all non-view ops return contiguous fake tensor outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104689
Approved by: https://github.com/ezyang
2023-08-24 16:58:50 +00:00
ydwu4
cbcd551045 Fix torch.compile FunctionalTensor inputs for higherOrderOps (#107604)
Before this PR, for the added [test](https://github.com/pytorch/pytorch/pull/107604/files#diff-c618f2274b6b5ccc533c580549d2e552edbd9fc5ac0da1aa4b00338525c8f78dR224), which feeds FunctionTensorWrapper inputs to higherOrderOperator, we have an assertion error in this line [code](https://github.com/pytorch/pytorch/pull/107604/files#diff-9f0663783bcd93e948e0491ef61b48123bdc9977bcc632fd707da578df13bfa1R1284).

The key difference of this PR is this [line ](https://github.com/pytorch/pytorch/pull/107604/files#diff-9f0663783bcd93e948e0491ef61b48123bdc9977bcc632fd707da578df13bfa1L1263)of check:
```python
        elif (
            isinstance(example_value, FakeTensor)
            and example_value.fake_mode is tx.fake_mode
        ):
```
The original intention of it seems to be dealing with case where we want to wrap an fx proxy for an intermediate fake tensor that's produced by some tensor ops and an example value is provided (as is the case for higherOrderOps [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/higher_order_ops.py#L85)). A fakified FunctionalTensorWrapper(FakeTensor) always fails this check. This PR changes it to checking whether it's already fakified by tx.fake_mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107604
Approved by: https://github.com/zou3519
ghstack dependencies: #107569
2023-08-23 02:42:18 +00:00
ydwu4
a408920817 Reland fakify FunctionalTensor (#107569)
Try to rebase and reland https://github.com/pytorch/pytorch/pull/107062 . One difference compared with previous is to make the DTensor logic same as previously in _clone_input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107569
Approved by: https://github.com/zou3519
2023-08-22 15:46:25 +00:00
Yukio Siraichi
bcede143bd Do not mutate SymNode expression. (#107492)
This PR stops `SymNode` from mutating (i.e. simplifying) its expression. Instead, the
simplification (without mutation) is deferred to the `SymNode.maybe_as_int` method.

```python
- FakeTensor(size=(s0,), ...)
- FakeTensor(size=(s1, s2, s3), ...)

- Eq(s0, s1 + s2 + s3)

- FakeTensor(size=(s0,), ...)
- FakeTensor(size=(s1, s2, s3), ...)
```

In summary, this PR:
- Replaces `SymNode._expr` by `SymNode.expr`, removing the old property function
    - This makes it so `SymNode` instances never update their expression
- Creates `SymNode.simplified_expr()` method for actually calling `ShapeEnv.replace` on
  its expression. Note that this doesn't updates `SymNode.expr`
- Changes how `tensor.size()` gets converted to its Python `torch.Size` type
    - Instead of calling `SymInt::maybe_as_int()` method, we create a new
      `SymInt::is_symbolic()` method for checking whether it is actually a symbolic value
    - This is needed so that when we call `tensor.size()` in the Python side, the returned
      sequence is faithful to the actual data, instead of possibly simplifying it and
      returning an integer
    - 2 files needs this modification:
        - _torch/csrc/Size.cpp_: for handling `torch.Tensor.size` Python calls
        - _torch/csrc/utils/pybind.cpp_: for handling `symint.cast()` C++ calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107492
Approved by: https://github.com/ezyang
ghstack dependencies: #107523
2023-08-22 12:38:05 +00:00
PyTorch MergeBot
96c5be8bc4 Revert "Fakify leaf of FunctionalTensor (#107062)"
This reverts commit 3349725766.

Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ydwu4 due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph. Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1685447747))
2023-08-21 00:30:16 +00:00
ydwu4
3349725766 Fakify leaf of FunctionalTensor (#107062)
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```

This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).

**Test Plan:**
Add a new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
2023-08-19 17:33:42 +00:00
PyTorch MergeBot
3c11184ca8 Revert "Fakify leaf of FunctionalTensor (#107062)"
This reverts commit 6cb0128c8a.

Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ZainRizvi due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph.  Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1684124230))
2023-08-18 16:02:54 +00:00
ydwu4
6cb0128c8a Fakify leaf of FunctionalTensor (#107062)
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```

This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).

**Test Plan:**
Add a new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
2023-08-18 03:05:45 +00:00
eellison
3495f0c999 Generate mypy hints for torch.Tag, add a couple of pointwise ops (#106910)
Replace https://github.com/pytorch/pytorch/pull/106739, since i had a bad CLA commit.

- adds clone, and convert_element_dtype to pointwise
- adds codegen for mypy hints of torch.Tag and removes existing ignores for them

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106910
Approved by: https://github.com/mlazos
2023-08-10 05:12:27 +00:00
Jason Lu
bc88028e8e Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743)
Summary:
Original commit changeset: 81319beb97f3

Original Phabricator Diff: D47961182

Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822

Reviewed By: atuljangra

Differential Revision: D48131623

@diff-train-skip-merge
(D48131623 landed internally)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743
Approved by: https://github.com/malfet
2023-08-08 15:27:34 +00:00
Peter Bell
d4d090e2da [FakeTensor] Workaround FFT ops with incorrect meta strides (#106319)
Currently there are FFT operators which raise `UnsupportedOperatorException`
because their meta implementations sometimes give incorrect strides. This works
around the problem for static shapes by falling back to eager. Though we still
don't support calls with dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106319
Approved by: https://github.com/ezyang
2023-08-07 20:59:30 +00:00
Edward Z. Yang
91afefb55b Fix some fake mode confusion between inner/outer fake mode in export (#106515)
Fixes https://github.com/pytorch/pytorch/issues/106412

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106515
Approved by: https://github.com/voznesenskym, https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-08-04 15:42:23 +00:00
Richard Zou
fd6e052a8a Some minor improvements to FakeTensor testing (#106311)
Summary:
- PyTorch testing chokes sometimes when it sees an exception where the first
  argument is not a string. fake_tensor.UnsupportedOperatorException's first
  arg is an OpOverload. This PR fixes PyTorch testing to not choke. I'm not
  really sure how to reproduce this in OSS.
- It turns out that if an operator does not have a meta kernel, the FakeTensor
  rule is really slow (30ms in OSS in debug mode, 3s on some internal config).
  The thing that is slow (aside from the previous diff) is waiting for the Dispatcher to
  report NotImplemented and then attempting to catch that. I'm not really sure
  why this is slow but it's easy to workaround so I added a workaround.

Test Plan: - existing tests

Differential Revision: D47917554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106311
Approved by: https://github.com/eellison
2023-08-03 01:44:15 +00:00
drisspg
f533791cd0 [SDPA] Mirror c++ implementation in FlashAttention meta func (#106477)
# Summary
Test edge case and update meta function to match the c++ implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106477
Approved by: https://github.com/eellison
2023-08-03 00:28:27 +00:00
PyTorch MergeBot
fdd4b3aaa8 Revert "faketensor: prevent deepcopy from cloning FakeTensorMode (#104476)"
This reverts commit c54afea6ee.

Reverted https://github.com/pytorch/pytorch/pull/104476 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal tests, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/104476#issuecomment-1661808343))
2023-08-02 08:56:33 +00:00
Mikayla Gawarecki
d8e5f2aa6d Reland "Make adding buffers more like adding parameters (#104069)" (#106224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224
Approved by: https://github.com/atalman, https://github.com/albanD
2023-07-31 17:18:56 +00:00
Brian Hirsh
c54afea6ee faketensor: prevent deepcopy from cloning FakeTensorMode (#104476)
fixes https://github.com/pytorch/pytorch/issues/104465

A more detailed repro is here, which uses `nn.TransformerLayer` (this breaks with AOTAutograd today, due to the presence of multiple FakeTensorMode objects lying around) https://github.com/pytorch/pytorch/issues/103505#issuecomment-1614817132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104476
Approved by: https://github.com/ezyang
2023-07-31 15:49:08 +00:00
Jason Ansel
3ecd05d9f3 Fix FakeTensor issues with copy_ between devices (#106172)
Used to fail with:
```
RuntimeError: Unhandled FakeTensor Device Propagation for aten.copy_.default, found two different devices cpu, cuda:0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106172
Approved by: https://github.com/eellison
2023-07-29 15:55:32 +00:00
Richard Zou
f3d165bf61 [fake_tensor] Don't run fallback for fbgemm ops (#106210)
Summary:
This diff also adds more warning messages around allowing a namespace into the
fallback. We need to grandfather in an operator to actually merge this diff.

Test Plan: - existing tests

Differential Revision: D47873841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106210
Approved by: https://github.com/eellison
2023-07-28 22:31:54 +00:00
Edward Z. Yang
884cd53e49 Unconditionally record when FakeTensorMode is allocated and report it on inconsistency (#105927)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105927
Approved by: https://github.com/albanD
2023-07-26 03:38:42 +00:00
Edward Z. Yang
4af9a914ab Improve FakeTensor to work with mixed meta-cpu embedding bag arguments (#105924)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105924
Approved by: https://github.com/mikaylagawarecki, https://github.com/eellison
2023-07-26 01:19:08 +00:00
Edward Z. Yang
5403c7770c Provide a refined upper bound for nonzero when original numel is static (#105843)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105843
Approved by: https://github.com/lezcano
2023-07-25 00:51:35 +00:00